# Pandas Exercise

When working on real world data tasks, you'll quickly realize that a large portion of your time is spent manipulating raw data into a form that you can actually work with, a process often called *data munging* or *data wrangling*.  Different programming langauges have different methods and packages to handle this task, with varying degrees of ease, and luckily for us, Python has an excellent one called Pandas which we will be using in this exercise.

In [1]:
import pandas as pd

## Importing data and working with Data Frames
The Data Frame is perhaps the most important object in Pandas and Data Science in Python, providing a plethora of functions for common data tasks.  Using only Pandas, do the following exercises.

1. Download the [free1.csv](https://vincentarelbundock.github.io/Rdatasets/csv/Zelig/free1.csv) from the [R Data Repository](https://vincentarelbundock.github.io/Rdatasets/datasets.html) and save it to the same directory as this notebook.  Then import into your environment as a Data Frame.  Now read [free2.csv](https://vincentarelbundock.github.io/Rdatasets/csv/Zelig/free2.csv) directly into a Data Frame from the URL.
1. Combine your `free1` Data Frame with `free2` into a single Data Frame, named `free_data`, and print the first few rows to verify that it worked correctly.  From here on out, this combined Data Frame is what we will be working with.
1. Print the last 10 rows.
1. Rename the first column (currently unamed), to `id`.  Print the column names to verify that it worked correctly.
1. What are the number of rows and columns of the Data Frame?
1. What are the data types of each column?  Can quantities like the mean be calculated for each columm?  If not, which one(s) and why?
1. Print out the first 5 rows of the `country` column.
1. How many unique values are in the `country` column?
1. Print out the number of occurences of each unique value in the `country` column.
1. Summarize the dataframe.
1. Were all columns included in the summary?  If not, print the summary again, forcing this column to appear in the result.
1. Print rows 100 to 110 of the `free1` Data Frame.
1. Print rows 100 to 110 of only the first 3 columns in `free1` using only indices.
1. Create and print a list containing the mean and the value counts of each column in the data frame **except** the `country` column.
1. Create a Data Frame, called `demographics`, using only the columns `sex`, `age`, and `educ` from the `free1` Data Frame.  Also create a Data Frame called `scores`, using only the columns `v1`, `v2`, `v3`, `v4`, `v5`, `v6` from the `free1` Data Frame
1. Loop through each row in `scores` and grab the largest value, in the `v_` columns, found in each row and store your results in two lists containing the value and column name it came from.  For example, row `0` is
```python
{'v1': 4, 'v2': 3, 'v3': 3, 'v4': 5, 'v5': 3, 'v6': 4}
```
the values
```python
('v4', 5)
```
should be added to your two lists.
1. Create a new Data Frame with columns named `cat` and `score` from your results in part (16), for the column with the largest score and the actual score respectively.
1. Using the Data Frame created in part (17), print the frequency of each column being the max score.

In [2]:
# Question 1
free1 = pd.read_csv("free1.csv")
free2 = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Zelig/free2.csv")

In [3]:
# Question 2
free_data = pd.concat((free1, free2), axis = 0)
free_data.head()

Unnamed: 0.1,Unnamed: 0,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
0,109276,0.0,20.0,4.0,Eurasia,1,4,3,3,5,3,4
1,88178,1.0,25.0,4.0,Oceana,2,3,3,5,5,5,5
2,111063,1.0,56.0,2.0,Eastasia,2,3,2,4,5,5,4
3,161488,0.0,65.0,6.0,Eastasia,2,3,3,5,5,5,5
4,44532,1.0,50.0,5.0,Oceana,1,5,3,5,5,3,5


In [4]:
# Question 3
free_data.tail(10)

Unnamed: 0.1,Unnamed: 0,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
440,56676,1.0,42.0,1.0,Oceana,5,4,1,3,3,2,4
441,58098,0.0,41.0,1.0,Oceana,5,3,4,3,3,4,4
442,117252,1.0,41.0,6.0,Eurasia,5,4,2,1,4,4,3
443,110212,0.0,40.0,3.0,Eurasia,4,3,2,3,4,3,3
444,168326,0.0,24.0,4.0,Eastasia,5,3,4,3,3,3,4
445,95744,1.0,70.0,1.0,Eastasia,3,2,1,1,2,1,1
446,109491,1.0,18.0,4.0,Eurasia,3,1,1,1,1,1,2
447,65788,1.0,19.0,1.0,Eastasia,5,3,3,3,3,3,3
448,147766,0.0,53.0,4.0,Eastasia,4,3,3,3,3,3,3
449,116952,1.0,18.0,3.0,Eurasia,5,4,4,4,4,4,4


In [5]:
# Question 4
free_data.rename(columns = {'Unnamed: 0' : 'id'}, inplace = True)
print(free_data.columns)

Index(['id', 'sex', 'age', 'educ', 'country', 'y', 'v1', 'v2', 'v3', 'v4',
       'v5', 'v6'],
      dtype='object')


In [6]:
# Question 5
nrows, ncols = free_data.shape
print("Number of rows = {}".format(nrows))
print("Number of columns = {}".format(ncols))

Number of rows = 900
Number of columns = 12


In [7]:
# Question 6
print(free_data.dtypes)
print("Properties like the mean can be calculated from all except country,as there is no 'mean' of a country object")

id           int64
sex        float64
age        float64
educ       float64
country     object
y            int64
v1           int64
v2           int64
v3           int64
v4           int64
v5           int64
v6           int64
dtype: object
Properties like the mean can be calculated from all except country,as there is no 'mean' of a country object


In [8]:
# Question 7
free_data.country[:5]

0     Eurasia
1      Oceana
2    Eastasia
3    Eastasia
4      Oceana
Name: country, dtype: object

In [9]:
# Question 8
num_countries = free_data['country'].unique().size
print("Number of unique countries = {}".format(num_countries))

Number of unique countries = 3


In [10]:
# Question 9
print("Counts per country:")
free_data['country'].value_counts()

Counts per country:


Eastasia    300
Eurasia     300
Oceana      300
Name: country, dtype: int64

In [11]:
# Question 10
free_data.describe()

Unnamed: 0,id,sex,age,educ,y,v1,v2,v3,v4,v5,v6
count,900.0,898.0,892.0,890.0,900.0,900.0,900.0,900.0,900.0,900.0,900.0
mean,90665.368889,0.556793,40.744395,2.941573,3.52,2.648889,2.535556,3.664444,4.084444,3.866667,4.38
std,44234.598996,0.497041,16.743316,1.600394,1.293709,1.151991,1.26731,1.016363,0.955973,0.984869,0.989399
min,142.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,52621.0,0.0,27.0,1.0,3.0,2.0,2.0,3.0,3.0,3.0,4.0
50%,108699.0,1.0,39.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,5.0
75%,119329.0,1.0,52.0,4.0,5.0,3.0,3.0,4.0,5.0,5.0,5.0
max,171811.0,1.0,90.0,7.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [12]:
# Question 11
free_data.describe(include = 'all')

Unnamed: 0,id,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
count,900.0,898.0,892.0,890.0,900,900.0,900.0,900.0,900.0,900.0,900.0,900.0
unique,,,,,3,,,,,,,
top,,,,,Eastasia,,,,,,,
freq,,,,,300,,,,,,,
mean,90665.368889,0.556793,40.744395,2.941573,,3.52,2.648889,2.535556,3.664444,4.084444,3.866667,4.38
std,44234.598996,0.497041,16.743316,1.600394,,1.293709,1.151991,1.26731,1.016363,0.955973,0.984869,0.989399
min,142.0,0.0,1.0,1.0,,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,52621.0,0.0,27.0,1.0,,3.0,2.0,2.0,3.0,3.0,3.0,4.0
50%,108699.0,1.0,39.0,3.0,,4.0,3.0,2.0,4.0,4.0,4.0,5.0
75%,119329.0,1.0,52.0,4.0,,5.0,3.0,3.0,4.0,5.0,5.0,5.0


In [13]:
# Question 12
free1.iloc[100:111]

Unnamed: 0.1,Unnamed: 0,sex,age,educ,country,y,v1,v2,v3,v4,v5,v6
100,71010,1.0,51.0,1.0,Eastasia,5,5,5,5,5,5,5
101,145298,1.0,20.0,2.0,Eastasia,2,2,1,3,4,5,3
102,162131,1.0,43.0,5.0,Eastasia,3,3,3,3,3,3,3
103,81406,0.0,45.0,1.0,Eastasia,1,1,1,1,1,1,1
104,164869,1.0,61.0,6.0,Eastasia,3,3,2,4,5,5,5
105,110303,0.0,37.0,6.0,Eurasia,3,3,2,3,5,3,3
106,78048,1.0,28.0,1.0,Oceana,2,2,2,4,5,4,5
107,118281,1.0,23.0,3.0,Eurasia,3,3,4,3,3,3,3
108,24024,1.0,75.0,3.0,Oceana,2,2,2,3,4,3,5
109,101158,1.0,25.0,3.0,Eastasia,2,5,4,3,4,5,2


In [14]:
# Question 13
free1.iloc[100:111, :3]

Unnamed: 0.1,Unnamed: 0,sex,age
100,71010,1.0,51.0
101,145298,1.0,20.0
102,162131,1.0,43.0
103,81406,0.0,45.0
104,164869,1.0,61.0
105,110303,0.0,37.0
106,78048,1.0,28.0
107,118281,1.0,23.0
108,24024,1.0,75.0
109,101158,1.0,25.0


In [15]:
# Question 14
results = []
for column in free_data.drop('country', axis = 1).columns:
    results.append((free_data[column].mean(), free_data[column].value_counts()))

In [16]:
# Question 15
demographics = free1[["sex", "age", "educ"]]
print("example from demographics:")
print(demographics.head())

scores = free1.loc[:,"v1":"v6"]
print("\n" + "example from scores:")
print(scores.head())

example from demographics:
   sex   age  educ
0  0.0  20.0   4.0
1  1.0  25.0   4.0
2  1.0  56.0   2.0
3  0.0  65.0   6.0
4  1.0  50.0   5.0

example from scores:
   v1  v2  v3  v4  v5  v6
0   4   3   3   5   3   4
1   3   3   5   5   5   5
2   3   2   4   5   5   4
3   3   3   5   5   5   5
4   5   3   5   5   3   5


In [17]:
# Question 16
max_value = []
max_index = []
for row in range(len(scores)):
    max_value.append(scores.iloc[row].max())
    max_index.append(scores.iloc[row].idxmax())

In [18]:
# Question 17
max_scores = pd.DataFrame({
    "cat": max_index,
    "score": max_value,
})
max_scores.head()

Unnamed: 0,cat,score
0,v4,5
1,v3,5
2,v4,5
3,v3,5
4,v1,5


In [19]:
# Question 18
max_scores.cat.value_counts()

v6    102
v4     99
v3     91
v1     84
v2     46
v5     28
Name: cat, dtype: int64