# ASSIGNMENT 7 - Weeks 8 & 9 - Pandas

In [2]:
#In this homework assignment, you will explore and analyze a public dataset of your choosing. Since this assignment is “open-ended” in nature, you are free to expand upon the requirements below. However, you must meet the minimum requirments as indicated in each section.

#Introduction
#The World Happiness Report measures global happiness through surveys collected by the Gallup World Poll. Answers are collected based on a Cantril ladder life question. Participants are asked to think of lifestyles on a ladder, with their ideal and best possible life being at a 10, or the top of the ladder, and their worst possible life at a 0, or the very bottom of the ladder. They are then asked to rate their own lives on this scale along with six main observed factors that may contribute to their responses. These six factors are: economic production, social support, life expectancy, freedom, absence of corruption, and generosity. This dataset features data collected from 2013-2016. 
#Happiness records are significant in recognizing the emotional and mental wellbeing of groups of people, and can, therefore, be used to make informed decisions on a governmental and organizational level to help improve quality of life for the general population. These reports can also be used to assess international progression. With this dataset, we can look at world happiness and use a variety of factors, such as social support and finances, to evaluate their potential impact on happiness. 
#I initially selected this dataset because of my interest in psychology. I am also curious to know how places all over the world compare to one another in terms of happiness, and what the contributive factors may be. For this assignment, I would like to see where the "happiest" place on Earth is, as well as the unhappiest, and assess potential reasons or factors. In order to use the most recent data, I will use data from the final year that this dataset includes: 2016. 

#The Kaggle dataset can be found using the following link: https://www.kaggle.com/datasets/unsdsn/world-happiness

import pandas as pd
url1 = "https://raw.githubusercontent.com/rkasa01/DATA602_ASSIGNMENT7/main/world-happiness-report.csv"
df1 = pd.read_csv(url1)
print("Variables in df1:")
print(df1.columns)  # Prints variables

#I first started by importing pandas and loaded the raw link for the dataset. I wanted to take a look at the dataset by printing the first five rows, and printing the variables as well to help better understand what we are looking at. 


#Data Exploration

print("Summary statistics for df1:")
print(df1.describe()) # Summary statistics for df1
print("\nMissing value information for df1:")
print(df1.isnull().sum()) #missing value info
print("\Information about df1:")
print(df1.info())

#Here, we have a plethora of useful information. Using the summary statistics, we can find the average, median, and the different quartiles for each of the assessed categories. For exaple, the mean of the life ladder is about 5.47, the median is about 5.39, and the first quartile is at 4.64 whereas the third is at 6.28. The problem with this information, is that it is a bit disorganized and is generalized world data. At this step of the data exploratoion, we can also see that there are many missing values, so we would have to consider filtering out and dropping certain categories which are not too useful to finding out which country is the happiest, unhappiest, and possible contributing factors. In the next step, I will work to clean the data and make it more suitable to answer my question above. 


#Data Wrangling

# World Happiness
print("World Happiness Records of 2016:")
df1_2016 = df1[df1['year'] == 2016]  # Filter only for 2016, subset
df1_2016 = df1_2016.drop(columns=['year'])  # Remove the 'year' column
df1_2016 = df1_2016.sort_values(by='Life Ladder', ascending=False)  # Sort by life ladder in descending order
df1_2016 = df1_2016.drop(columns=['Generosity'])  # Drop the 'generosity' column from df1_2016
# Rename columns
df1_2016 = df1_2016.rename(columns={
    'Log GDP per capita': 'Per Capita Income',
    'Freedom to make life choices': 'Freedom of Choice',
    'Perceptions of corruption': 'Perceived Corruption',
    'Healthy life expectancy at birth': 'Healthy Life Expectancy'
})
# Calculate 'Pos:Neg affect' ratio
df1_2016['Pos:Neg affect'] = df1_2016['Positive affect'] / df1_2016['Negative affect']
df1_2016.drop(['Positive affect', 'Negative affect'], axis=1, inplace=True)
df1_2016 = df1_2016.rename(columns={'Pos:Neg affect': 'pos:neg affect'})

df1_2016 = df1_2016.dropna(how='any')  # Drop rows with any missing values in any of the columns
df1_2016.reset_index(drop=True, inplace=True)
print(df1_2016)

#Here I filtered the dataset to work with the most recent year, 2016. I renamed some of the columns and then created a ratio to assess positive and negative affect in a more relational way. The higher this ratio, then the more positively the general population feels in a given country compared to how negatively they feel. I sorted the values for the life ladder score in descending order, such that the highest score appears first, and the lowest one appears last. It looks like the happiest place in 2016 was Finland, with a mean life ladder score of  7.660. The healthy life expectancy age here was 71.7. In this assignment, I will go on to to see which factors most significantly impact the life ladder score, and therefore, general happiness.   

# World Unhappiness
print("World Unhappiness Records 2016:")
df1_2016_ascending = df1_2016.copy()
df1_2016_ascending = df1_2016_ascending.sort_values(by='Life Ladder', ascending=True)
print(df1_2016_ascending)
#Since we printed a World Happiness dataframe ealier, I added a World Unhappiness datarame, so that we may compare the two. It looks like the unhappiest place in 2016 was the Central African Republic, with a mean life ladder score of  2.693. The healthy life expectancy age here is 44.9, which is relatively young. I wanted to then do some calculations to see which factors most significantly impact the life ladder score, and therefore, general happiness. 

# Calculate factors associated with happiness
columns_to_consider = ['Life Ladder', 'Per Capita Income', 'Social support', 'Healthy Life Expectancy', 'Freedom of Choice', 'Perceived Corruption', 'pos:neg affect']
df_subset = df1_2016[columns_to_consider]
correlation_matrix = df_subset.corr()
life_ladder_correlation = correlation_matrix['Life Ladder']

print("Factors negatively associated with happiness:")
factors_with_negative_correlation = life_ladder_correlation[life_ladder_correlation < 0]
factors_sorted_by_correlation = factors_with_negative_correlation.sort_values(ascending=True)
print(factors_sorted_by_correlation)

print("Factors positively associated with happiness:")
factors_with_positive_correlation = life_ladder_correlation[life_ladder_correlation > 0]
factors_sorted_by_correlation = factors_with_positive_correlation.sort_values(ascending=False)
print(factors_sorted_by_correlation)

#Here, I calculated the association of each observed factor to the ladder life score. I found that the only negatively correlated variable was perceived corruption. With a value of about -.43, this means that it did not have a strong association with happiness. If this number was positive, that would indicate that it may impact happiness. Conversely, per capita income had a correlation of about .83 with happiness, healthy life expectancy was about .80, social support was about .74, freedom of choice was about .51, and the positive:negative affect ratio was about .50. It seems like finances and health are the two which majorly contribute to happiness, whereas the rest also have an impact but to a lesser extent.

#I used these two factors to aggregate and find the overall global mean, minimum, and maximum values. 

is_life_ladder_numeric = pd.to_numeric(df1_2016['Life Ladder'], errors='coerce').notna().all() #is this column numeric?
print("True or False: Is the column 'Life Ladder' numeric?", is_life_ladder_numeric)

#Because it looks like per capita income has the highest impact on happiness:
#Mean, minimum, and maximum 'Per Capita Income' for all countries
global_per_capita_income = df1_2016['Per Capita Income'].agg(['mean', 'min', 'max']) #aggregated
print("Global Per Capita Income Statistics:")
print(global_per_capita_income)



#Adding the second highest factor: Healthy Life Expectancy as the second column
grouped = df1_2016.groupby(['Country name', 'Healthy Life Expectancy']) #grouped by two factors 
sorted_results = grouped.apply(lambda x: x.sort_values(by='Per Capita Income', ascending=True)) #sorted
sorted_results.reset_index(drop=True, inplace=True)
print(sorted_results)


#Conclusions
#After exploring this data, I can conclude that the top three factors which majorly contribute to happiness, even on a global scale, are income, healthy life expectancy at birth, and social support. With more time, I would like to see the correlation of each key factor to happiness from each country. This way, we can compare those values to the global ones! We would be able to see how each country compares to the global values.

Variables in df1:
Index(['Country name', 'year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect'],
      dtype='object')
Summary statistics for df1:
              year  Life Ladder  ...  Positive affect  Negative affect
count  1949.000000  1949.000000  ...      1927.000000      1933.000000
mean   2013.216008     5.466705  ...         0.710003         0.268544
std       4.166828     1.115711  ...         0.107100         0.085168
min    2005.000000     2.375000  ...         0.322000         0.083000
25%    2010.000000     4.640000  ...         0.625500         0.206000
50%    2013.000000     5.386000  ...         0.722000         0.258000
75%    2017.000000     6.283000  ...         0.799000         0.320000
max    2020.000000     8.019000  ...         0.944000         0.705000

[8 rows x 10 columns]

Missing valu