# Pandas Exercise

When working on real world data tasks, you'll quickly realize that a large portion of your time is spent manipulating raw data into a form that you can actually work with, a process often called *data munging* or *data wrangling*.  Different programming langauges have different methods and packages to handle this task, with varying degrees of ease, and luckily for us, Python has an excellent one called Pandas which we will be using in this exercise.

In [1]:
import pandas as pd
import numpy as np

## Merging, Indexes, and  `Apply()`
Much of the power of Data Sciences comes from the ability to join together datasets from very different sources.  One could be interested in seeing if there is a relationship between housing prices and prevalence of infectious disease in a given ZIP code for example.  This task is often referred to as a *merge* or *join*.

Every Pandas Data Frame has an *index*.  Indices in Pandas are a bit of a complex topic, but for the time being consider them to be a unique identifier for each row in a Data Frame.  When performing joins and manipulating Data Frames, it is important to remember that your task may require the creation or change of the Data Frame's index.  For more extensive reading on this topic, consult the [Pandas Documentation](http://tomaugspurger.github.io/modern-3-indexes.html).

And lastly, if you are coming from a programming background like C/C++ or Java, you are likely very accustomed to operating on arrays and lists using for loops.  Often this is how you will want to work with Data Frames in Python, but Pandas also provides functionality for functional like programming by utilizing the `Apply()` function.  This is similar to the `apply` family of functions in R and the `Map()` and related functions in Lisp.  Making use of `Apply()` in Python can make your code more concise, readable, and faster when performing operations on an entire Data Frame.

Using on Pandas, perform the following exercises.

1. Using the free1.csv downloaded above, import it as a Data Frame named `free_data` and rename the first column to id.
1. Create a dataframe named `free_sub`, consisting of the `id`, `country`, and `y` columns from `free_data`.
1. Create a new Data Frame called `ed_level`, consisting of the `id` and three categories of education levels, labeled `high`, `med`, and `low`, for ranges of your choosing.  Do this using a for loop.
1. Merge `free_sub` and `ed_level` together.  Which column should the merge be performed on?  Do this using both the `concat()` and `merge()` functions.
1. Use the `append()` function to join together `free_sub` and `ed_level`.  Are the results the same as in part (4)?  If not, how could you reproduce the result `append()` by using `concat()` or `merge()`?
1. Use numpy to generate two lists 100 random floats labeled `y1` and `y2`.  Now create a sequence of integers on the range 0-100 labeled `x1` and a sequence of integers on the range 50-150 labeled `x2`.  Create two DataFrames, `dat1` and `dat2` consisting of `x1` and `y1`, and `x2` and `y2` respectively, but having labels `x, y1`, and `x, y2`.  Use `merge()` to join these two Data Frames together, on `x`, using both an inner and outer join.  What is the difference between the two joins?
1. Create a Data Frame, called `scores` consising of only the `y` and `v_` columns from `free_data`.
1. Using a for loop(s), compute the sum and mean for each column in `scores`.
1. Using the `apply()` function, compute the sum and mean for each column in `scores`.
1. Using the `apply()` function, label each column in `scores` as either `high`, `med`, or `low` by first computing the mean for each column and assigning the categories at values of your choosing.  Do this by writing a single function you can call with `apply()`.

In [2]:
# Question 1
free_data = pd.read_csv("free1.csv")
free_data.rename(columns = {'Unnamed: 0' : 'id'}, inplace = True)
free_data.head()

Unnamed: 0,id,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
0,109276,0.0,20.0,4.0,Eurasia,1,4,3,3,5,3,4
1,88178,1.0,25.0,4.0,Oceana,2,3,3,5,5,5,5
2,111063,1.0,56.0,2.0,Eastasia,2,3,2,4,5,5,4
3,161488,0.0,65.0,6.0,Eastasia,2,3,3,5,5,5,5
4,44532,1.0,50.0,5.0,Oceana,1,5,3,5,5,3,5


In [3]:
# Question 2
free_sub = free_data[['id', 'country', 'y']]
free_sub.head()

Unnamed: 0,id,country,y
0,109276,Eurasia,1
1,88178,Oceana,2
2,111063,Eastasia,2
3,161488,Eastasia,2
4,44532,Oceana,1


In [4]:
# Question 3
educ_cat = []
for i in range(len(free_data)):
    if free_data.iloc[i].educ < 3:
        educ_cat.append("low")
    elif free_data.iloc[i].educ >= 3 and free_data.iloc[i].educ <= 5:
        educ_cat.append("med")
    else:
        educ_cat.append("high")
ed_level = pd.DataFrame({'id' : free_data.id, 'educ_cat' : educ_cat})
ed_level.head()

Unnamed: 0,educ_cat,id
0,med,109276
1,med,88178
2,low,111063
3,high,161488
4,med,44532


In [5]:
# Question 4
free_sub.set_index('id')
ed_level.set_index('id')

concat_df = pd.concat([free_sub, ed_level], axis = 1, join = 'inner')
print("Concatenate:")
print(concat_df.head())

merge_df = pd.merge(left = free_sub, right = ed_level, on = "id", how = "inner")
print("\nMerge:")
print(merge_df.head())

Concatenate:
       id   country  y educ_cat      id
0  109276   Eurasia  1      med  109276
1   88178    Oceana  2      med   88178
2  111063  Eastasia  2      low  111063
3  161488  Eastasia  2     high  161488
4   44532    Oceana  1      med   44532

Merge:
       id   country  y educ_cat
0  109276   Eurasia  1      med
1   88178    Oceana  2      med
2  111063  Eastasia  2      low
3  161488  Eastasia  2     high
4   44532    Oceana  1      med


In [6]:
# Question 5
append_df1 = free_sub.append(ed_level)
print("Append:")
print(append_df1.head())

append_df2 = pd.concat([free_sub, ed_level], axis = 0)
print("\nConcatenate:")
print(append_df2.head())

print("\nReplication of this append process not possible with Merge.")

Append:
    country educ_cat      id    y
0   Eurasia      NaN  109276  1.0
1    Oceana      NaN   88178  2.0
2  Eastasia      NaN  111063  2.0
3  Eastasia      NaN  161488  2.0
4    Oceana      NaN   44532  1.0

Concatenate:
    country educ_cat      id    y
0   Eurasia      NaN  109276  1.0
1    Oceana      NaN   88178  2.0
2  Eastasia      NaN  111063  2.0
3  Eastasia      NaN  161488  2.0
4    Oceana      NaN   44532  1.0

Replication of this append process not possible with Merge.


In [7]:
# Question 6
# Generate data
np.random.seed(2017)
y1 = np.random.rand(100)
y2 = np.random.rand(100)
x1 = range(100)
x2 = range(50, 150)

# Create dataframes
dat1 = pd.DataFrame({'x' : x1, 'y1' : y1})
dat2 = pd.DataFrame({'x' : x2, 'y2' : y2})
dat1.set_index('x')
dat2.set_index('x')

# Merge data
inner_join = pd.merge(left = dat1, right = dat2, on = "x", how = "inner")
outer_join = pd.merge(left = dat1, right = dat2, on = "x", how = "outer")

# Print results
print("Head of inner join (shape = {}):".format(inner_join.shape))
print(inner_join.head())

print("\nHead of outer join (shape = {}):".format(outer_join.shape))
print(outer_join.head())

print("""\nThe inner join only retains the 50 rows with shared indices and thus results in no NaN values, unlike the
outer join, which returns all rows using NaN to mark the missing value from the extra data frame.""")

Head of inner join (shape = (50, 3)):
    x        y1        y2
0  50  0.262496  0.866014
1  51  0.950372  0.019194
2  52  0.271928  0.471329
3  53  0.187117  0.597803
4  54  0.501474  0.846562

Head of outer join (shape = (150, 3)):
   x        y1  y2
0  0  0.020960 NaN
1  1  0.767070 NaN
2  2  0.447920 NaN
3  3  0.120542 NaN
4  4  0.930773 NaN

The inner join only retains the 50 rows with shared indices and thus results in no NaN values, unlike the
outer join, which returns all rows using NaN to mark the missing value from the extra data frame.


In [8]:
# Question 7
scores = free_data.loc[:, 'y':'v6']
scores.head()

Unnamed: 0,y,v1,v2,v3,v4,v5,v6
0,1,4,3,3,5,3,4
1,2,3,3,5,5,5,5
2,2,3,2,4,5,5,4
3,2,3,3,5,5,5,5
4,1,5,3,5,5,3,5


In [9]:
# Question 8
score_sums = []
score_means = []
for column in scores.columns:
    score_sums.append(np.sum(scores[column]))
    score_means.append(np.mean(scores[column]))

In [10]:
# Question 9
score_sums_v2 = scores.apply(np.sum, axis = 0)
score_means_v2 = scores.apply(np.mean, axis = 0)

print("Sums:")
print(score_sums_v2)

print("\nMeans:")
print(score_means_v2)

Sums:
y     1584
v1    1192
v2    1141
v3    1649
v4    1838
v5    1740
v6    1971
dtype: int64

Means:
y     3.520000
v1    2.648889
v2    2.535556
v3    3.664444
v4    4.084444
v5    3.866667
v6    4.380000
dtype: float64


In [11]:
# Question 10
def label_score(x):
    if x < 3:
        return "low"
    elif x >= 3 and x <= 4:
        return "med"
    else:
        return "high"
scores.apply(np.mean, axis = 0).apply(label_score)

y      med
v1     low
v2     low
v3     med
v4    high
v5     med
v6    high
dtype: object