# Pandas Exercise

When working on real world data tasks, you'll quickly realize that a large portion of your time is spent manipulating raw data into a form that you can actually work with, a process often called *data munging* or *data wrangling*.  Different programming langauges have different methods and packages to handle this task, with varying degrees of ease, and luckily for us, Python has an excellent one called Pandas which we will be using in this exercise.

In [1]:
import pandas as pd

## Sorting, Filtering, and Grouping data
Most of the time, we'll want to rearrange the data a bit, include only certain values in our analysis, or put the data into useful groups.  Pandas provides syntax and many functions to do this.

Using only Pandas, do the following exercises.

1. Using the `free1.csv` downloaded above, import it as a Data Frame named `free_data`, rename the first column to `id`, and print the first few rows.
1. Sort `free_data` by `country`, `educ`, and then by `age` in decending order, modifying the original Data Frame.
1. Create a new Data Frame called `uni` containing only rows from `free_data` which indicate that the person attended university or graduate school.  Print the value counts for each country.
1. Create a list of three Data Frames for those who are less than 25 years old, between 25 and 50 years old, and older than 50.
1. Using a for loop, create a list of 3 Data Frames each containing only one of the 3 countries.
1. Create a list of age categories, labled 0, 1, and 2 for each row for the three groups made in part (4).  Attach this list to the `free_data` dataframe as a column named `age_cat`.
1. Print the mean for all columns for each `age_cat` using `groupby`.
1. Print the mean education for each `age_cat` using `groupby`.
1. Print summary statistics for each column for those with an education greater than or equal to 5, grouped by `age_cat`.
1. Which of the vignette has the largest mean score for each education level?  What about the median?
1. Which country would you say has the most freedom of speech?  Be sure to justify your answer quantitatively.
1. Is there a difference of opinion between men and women regarding freedom of speech?  If any, does this difference manifest itself accross the different countries?  Accross education levels?  Be sure to justify your answers quantiatively.

In [2]:
# Question 1
free_data = pd.read_csv("free1.csv")
free_data.rename(columns = {'Unnamed: 0' : 'id'}, inplace = True)
free_data.head()

Unnamed: 0,id,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
0,109276,0.0,20.0,4.0,Eurasia,1,4,3,3,5,3,4
1,88178,1.0,25.0,4.0,Oceana,2,3,3,5,5,5,5
2,111063,1.0,56.0,2.0,Eastasia,2,3,2,4,5,5,4
3,161488,0.0,65.0,6.0,Eastasia,2,3,3,5,5,5,5
4,44532,1.0,50.0,5.0,Oceana,1,5,3,5,5,3,5


In [3]:
# Question 2
free_data.sort_values(["country", "educ", "age"], ascending = False, inplace=True)
free_data.head()

Unnamed: 0,id,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
62,30485,0.0,68.0,7.0,Oceana,2,3,4,3,4,5,5
34,25441,0.0,42.0,7.0,Oceana,2,4,4,4,4,4,5
401,26614,0.0,33.0,7.0,Oceana,5,3,5,4,5,4,3
151,88856,0.0,30.0,7.0,Oceana,3,2,2,5,4,4,5
115,24643,0.0,59.0,6.0,Oceana,3,3,2,4,4,3,5


In [4]:
# Question 3 (Attended university = 5, graduated university = 6, graduated graduate school = 7)
uni = free_data[free_data.educ >= 5]
print("Number who attended university: {}".format(uni.shape[0]))
print("\n" + "Number who attended university per country:")
uni.country.value_counts()

Number who attended university: 81

Number who attended university per country:


Eastasia    33
Eurasia     27
Oceana      21
Name: country, dtype: int64

In [5]:
# Question 4
by_age = []
by_age.append(free_data[free_data.age < 25])
by_age.append(free_data[(free_data.age >= 25) & (free_data.age <= 50)])
by_age.append(free_data[free_data.age > 50])

print("Number less than 25 years old: {}".format(by_age[0].shape[0]))
print("Number between 25 and 50 years old: {}".format(by_age[1].shape[0]))
print("Number older than 50: {}".format(by_age[2].shape[0]))

Number less than 25 years old: 87
Number between 25 and 50 years old: 236
Number older than 50: 123


In [6]:
# Question 5
countries = []
for my_country in free_data.country.unique():
    countries.append(free_data[free_data.country == my_country])

In [7]:
# Question 6
age_cat = []
for person in free_data.age:
    if person < 25:
        age_cat.append(0)
    elif person >= 25 and person <= 50:
        age_cat.append(1)
    else:
        age_cat.append(2)
free_data["age_cat"] = age_cat
free_data.head()

Unnamed: 0,id,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6,age_cat
62,30485,0.0,68.0,7.0,Oceana,2,3,4,3,4,5,5,2
34,25441,0.0,42.0,7.0,Oceana,2,4,4,4,4,4,5,1
401,26614,0.0,33.0,7.0,Oceana,5,3,5,4,5,4,3,1
151,88856,0.0,30.0,7.0,Oceana,3,2,2,5,4,4,5,1
115,24643,0.0,59.0,6.0,Oceana,3,3,2,4,4,3,5,2


In [8]:
# Question 7
grouped_ages = free_data.groupby("age_cat")
grouped_ages.mean()

Unnamed: 0_level_0,id,sex,age,educ,y,v1,v2,v3,v4,v5,v6
age_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,97924.298851,0.534884,20.505747,3.317647,3.643678,2.517241,2.597701,3.655172,3.954023,3.781609,4.471264
1,92976.567797,0.559322,36.686441,3.034335,3.605932,2.631356,2.474576,3.580508,4.055085,3.830508,4.266949
2,81397.889764,0.566929,62.845528,2.519685,3.275591,2.771654,2.606299,3.826772,4.228346,3.992126,4.527559


In [9]:
# Question 8
print("Mean education by age group:")
grouped_ages.educ.mean()

Mean education by age group:


age_cat
0    3.317647
1    3.034335
2    2.519685
Name: educ, dtype: float64

In [52]:
# Question 9
grouped_uni_ages = free_data[free_data.educ >= 5.0].groupby("age_cat")
grouped_uni_ages.describe().stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,educ,id,sex,v1,v2,v3,v4,v5,v6,y
age_cat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
0,mean,21.75,5.2,97745.45,0.5,2.8,3.1,3.75,4.2,4.0,4.65,3.15
0,std,1.585294,0.410391,35834.63666,0.512989,1.151658,1.48324,0.786398,0.695852,0.858395,0.587143,1.386969
0,min,19.0,5.0,31052.0,0.0,1.0,1.0,3.0,3.0,3.0,3.0,1.0
0,25%,20.0,5.0,90588.25,0.0,2.0,2.0,3.0,4.0,3.0,4.0,2.0
0,50%,22.0,5.0,108682.0,0.5,3.0,3.0,4.0,4.0,4.0,5.0,3.0
0,75%,23.0,5.0,116392.25,1.0,4.0,4.25,4.0,5.0,5.0,5.0,4.0
0,max,24.0,6.0,171662.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
1,count,43.0,43.0,43.0,43.0,43.0,43.0,43.0,43.0,43.0,43.0,43.0
1,mean,36.767442,5.488372,89611.930233,0.488372,2.767442,2.395349,3.581395,4.093023,3.72093,4.162791,3.581395


In [24]:
# Question 10
# Create aggregations
grouped_educs = free_data[['educ', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6']].groupby("educ")
educ_means = grouped_educs.mean()
educ_medians = grouped_educs.median()

# Loop through education levels to find maximum means and medians
max_means = []
max_medians = []
for level in range(len(educ_means)):
    max_means.append(educ_means.iloc[level].idxmax())
    max_medians.append(educ_means.iloc[level].idxmax())

# Print results
print("Level\tMaximum mean\tMaximum median")
for level in range(len(educ_means)):
    print("{}\t{}\t\t{}".format(educ_means.index[level], max_means[level], max_medians[level])) 

Level	Maximum mean	Maximum median
1.0	v6		v6
2.0	v6		v6
3.0	v6		v6
4.0	v6		v6
5.0	v6		v6
6.0	v6		v6
7.0	v4		v4


In [59]:
# Question 11
# 1 = Completely free to 5 = Not free at all
grouped_countries = free_data.groupby("country")
print(grouped_countries['y'].describe())
print("""Oceana has the most self-reported freedom of speech, as they have the lowest both mean and median 'y' scores
while the standard deviation isn't large enough to overcome that difference.""")

          count      mean       std  min  25%  50%  75%  max
country                                                     
Eastasia  150.0  3.660000  1.163367  1.0  3.0  4.0  5.0  5.0
Eurasia   150.0  4.013333  1.152684  1.0  3.0  4.0  5.0  5.0
Oceana    150.0  2.886667  1.303257  1.0  2.0  3.0  4.0  5.0
Oceana has the most self-reported freedom of speech, as they have the lowest both mean and median 'y' scores
while the standard deviation isn't large enough to overcome that difference.


In [56]:
# Question 12
# 1 = man, 0 = woman
grouped_genders = free_data.groupby("sex")
print(grouped_genders['y'].describe())
print("""The sexes have reasonably even self-reported freedom of speech scores. The medians are equal, while male
reports have a slightly more 'free' mean.\n""")

grouped_gender_country = free_data.groupby(("sex", "country"))
print(grouped_gender_country['y'].describe())
print("""When broken out by country, the gender profile has the same overall breakdown as in total. Median scores are
equivalent between men and women, while men report slightly increased levels of freedom of speech in all countries,
particularly Eastasia.\n""")

grouped_gender_educ = free_data.groupby(("sex", "educ"))
print(grouped_gender_educ['y'].describe())
print("""The gender breakdowns are most skewed when examined in light of educational backgrounds. Those who had
completed primary or secondary school (3 or 4) had essentially equal freedom of speech reports between men and women,
and those who had completed less than primary school education (2) actually reported more freedom of speech in women
than men. For the group who had no formal education (1), the trend followed the total gender breakdown of equal means
and slight freedom advantages toward men. However, for all groups with a high school diploma and above, men reported
higher levels of freedom of speech than women.\n""")

     count      mean       std  min  25%  50%  75%  max
sex                                                    
0.0  199.0  3.417085  1.337837  1.0  3.0  4.0  5.0  5.0
1.0  250.0  3.600000  1.257907  1.0  3.0  4.0  5.0  5.0
The sexes have reasonably even self-reported freedom of speech scores. The medians are equal, while male
reports have a slightly more 'free' mean.

              count      mean       std  min  25%  50%   75%  max
sex country                                                      
0.0 Eastasia   58.0  3.500000  1.260047  1.0  3.0  4.0  4.75  5.0
    Eurasia    66.0  4.000000  1.123182  1.0  3.0  4.0  5.00  5.0
    Oceana     75.0  2.840000  1.346065  1.0  2.0  3.0  4.00  5.0
1.0 Eastasia   92.0  3.760870  1.093131  1.0  3.0  4.0  5.00  5.0
    Eurasia    84.0  4.023810  1.181955  1.0  3.0  4.0  5.00  5.0
    Oceana     74.0  2.918919  1.268769  1.0  2.0  3.0  4.00  5.0
When broken out by country, the gender profile has the same overall breakdown as in total. Median sc