# Lesson 3 Class Exercises: Pandas Part 1
With these class exercises we learn a few new things.  When new knowledge is introduced you'll see the icon shown on the right: 
<span style="float:right; margin-left:10px; clear:both;">![Task](https://github.com/spficklin/Data-Analytics-With-Python/blob/master/media/new_knowledge.png?raw=true)</span>

## Reminder
The first checkin-in of the project is due next Tueday.  After today, you should have everything you need to know to accomplish that first part. 

## Get Started
Import the Numpy and Pandas packages

In [2]:
import numpy as np
import pandas as pd

## Exercise 1: Import Iris Data
Import the Iris dataset made available to you in the last class period for the Numpy part2 exercises. Save it to a variable naemd `iris`. Print the first 5 rows and the dimensions to ensure it was read in properly.

In [None]:
iris = pd.read_csv('data/iris.csv')
print(iris.head())
iris.shape

Notice how much easier this was to import compared to the Numpy `genfromtxt`. We did not have to skip the headers, we did not have to specify the data type and we can have mixed data types in the same matrix.

## Exercise 2: Import Legislators Data
For portions of this notebook we will use a public dataset that contains all of the current legistators of the United States Congress. This dataset can be found [here](https://github.com/unitedstates/congress-legislators).  

Import the data directly from this URL:  https://theunitedstates.io/congress-legislators/legislators-current.csv

Save the data in a variable named `legistators`. Print the first 5 lines, and the dimensions.

In [None]:
legislators = pd.read_csv("https://theunitedstates.io/congress-legislators/legislators-current.csv")
print(legislators.head())
legistators.shape

## Exercise 3: Explore the Data
### Task 1
Print the column names of the legistators dataframe and explore the type of data in the data frame. 

In [None]:
legislators.columns

### Task 2
Show the datatypes of all of the columns in the legislator data. Do all of the data types seem appropriate for the data? 

In [None]:
legislators.dtypes

Show all of the datayptes in the iris dataframe

In [None]:
iris.dtypes

### Task 3
It's always important to know where the missing values are in your data. Are there any missing values in the legislators dataframe? How many per column?  

Hint: we didn't learn how to find missing values in the lesson, but we can use the `isna()` function.

In [None]:
legislators.isna().sum()

How about in the iris dataframe?

In [None]:
iris.isna().sum()

### Task 4
It is also important to know if you have any duplicatd rows. If you are performing statistcal analyses and you have duplicated entries they can affect the results.  So, let's find out.  Are there any duplicated rows in the legislators dataframe?  Print then number of duplicates. If there are duplicates print the rows. What function could we used to find out if we have duplicated rows?

In [None]:
legislators.duplicated().sum()

Do we have duplicated rows in the iris dataset? Print the number of duplicates? If there are duplicates print the rows.

In [None]:
iris.duplicated().sum()

In [None]:
iris[iris.duplicated()]

If there are duplicated rows should we remove them or keep them?

### Task 5
It is important to also check that the range of values in our data matches expectations.  For example, if we expect to have four species in our iris data, we should check that we see four species. How many political parties should we expect in the legislators data?  If all we saw were a single part perhaps the data is incomplete.... Let's check.   You can find out how many unique values there are per column using the `nunique` function.  Try it for both the legislators and the iris data set.

In [None]:
legislators.nunique()

In [None]:
iris.nunique()

What do you think?  Do we see what we might expect?  Are there fields where this type of check doesn't matter? In what fields might this type of exploration matter?

Check to see if you have all of the values expected for a given field. Pick a column you know should have a set number of values and print all of the unique values in that column. Do so for both the legislator and iris datasets.

In [None]:
print(legislators['gender'].unique())

In [None]:
print(iris['species'].unique())

## Exercise 5: Describe the data
For both the legislators and the iris data, get descriptive statistics for each numeric field.

In [None]:
iris.describe()

In [None]:
legislators.describe()

## Exercise 6: Row Index Labels
For the legislator dataframe, let's change the row labels from numerical indexes to something more recognizable.  Take a look at the columns of data, is there anything you might want to substitue as a row label?  Pick one and set the index lables. Then print the top 5 rows to see if the index labels are present.

In [None]:
legislators.index = legislators['full_name']
print(legislators.head(5))

## Exercise 7: Indexing & Sampling
Randomly select 15 Republicans or Democrats (your choice) from the senate.

In [None]:
legislators[(~legislators['senate_class'].isna()) & (legislators['party'] == "Republican")].sample(15)

## Exercise 8: Dates
<span style="float:right; margin-left:10px; clear:both;">![Task](https://github.com/spficklin/Data-Analytics-With-Python/blob/master/media/new_knowledge.png?raw=true)</span>

Let's learn something not covered in the Pandas 1 lesson regarding dates.  We have the birthdates for each legislator, but they are in a String format.  Let's convert it to a datetime object. We can do this using the `pd.to_datetime` function.  Take a look at the online documentation to see how to use this function. Convert the `legislators['birthday']` column to a `datetime` object. Confirm that the column is now a datetime object.

In [None]:
legislators['birthday'] = pd.to_datetime(legislators['birthday'])

In [None]:
legislators['birthday'].head()

Now that we have the birthdays in a `datetime` object, how can we calculate their age?  Hint: we can use the `pd.Timestamp.now()` function to get a datetime object for this moment. Let's subtract the current time from their birthdays.  Print the top 5 results.

In [None]:
(pd.Timestamp.now() - legislators['birthday']).head()

Notice that the result of subtracting two `datetime` objects is a `timedelta` object. It contains the difference between two time values. The value we calculated therefore gives us the number of days old.  However, we want the number of years. 

To get the number of years we can divide the number of days old by the number of days in a year (i.e. 365). However, we need to extract out the days from the `datetime` object. To get this, the Pandas Series object has an accessor for extracting components of `datetime` objects and `timedelta` objects. It's named `dt` and it works for both.  You can learn more about the attributes of this accessor at the [datetime objects page](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties) and the [timedelta objects page](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#timedelta-properties) by clicking.  Take a moment to look over that documentation.

How would then extract the days in order to divide by 365 to get the years?  Once you've figurd it out. Do so, convert the years to an integer and add the resulting series back into the legislator dataframe as a new column named `age`.  Hint: use the [astype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) function of Pandas to convert the type.

In [None]:
age = ((pd.Timestamp.now() - legislators['birthday']).dt.days / 365).astype('int')
legislators['age'] = age

Next, find the youngest, oldest and average age of all legislators

In [None]:
legislators.describe()

Who are the oldest and youngest legislators?

In [None]:
legislators[(legislators['age'] == 86) | (legislators['age']==30)]

## Exercise 9:  Indexing with loc and iloc
Reindex the legislators dataframe using the state, and find all legislators from your home state using the `loc` accessor.

In [None]:
legislators.index = legislators['state']
legislators.loc['SC']

Use the loc command to find all legislators from South Carolina and North Carolina

In [None]:
legislators.loc[['SC', 'NC']]

Use the loc command to retrieve all legislators from California, Oregon and Washington and only get their full name, state, party and age

In [None]:
legislators.loc[['CA', 'OR', 'WA'], ['full_name', 'state', 'party', 'age']]

## Exercise 10: Economics Data Example
### Task 1: Explore the data
Import the data from the [Lectures in Quantiatives Economics](https://github.com/QuantEcon/lecture-source-py) regarding minimum wages in countries round the world in US Dollars.  You can view the data [here](https://github.com/QuantEcon/lecture-source-py/blob/master/source/_static/lecture_specific/pandas_panel/realwage.csv) and you can access the data file here: https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv.  Then perform the following

Import and print the first 5 lines of data to explore what is there.

In [3]:
minwages = pd.read_csv('https://raw.githubusercontent.com/QuantEcon/lecture-source-py/master/source/_static/lecture_specific/pandas_panel/realwage.csv')

In [4]:
minwages.head()

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Find the shape of the data.

In [5]:
minwages.shape

(1408, 6)

List the column names.

In [6]:
minwages.columns

Index(['Unnamed: 0', 'Time', 'Country', 'Series', 'Pay period', 'value'], dtype='object')

Identify the data types. Do they match what you would expect?

In [7]:
minwages.dtypes

Unnamed: 0      int64
Time           object
Country        object
Series         object
Pay period     object
value         float64
dtype: object

Identify columns with missing values. 

In [8]:
minwages.isna().sum()

Unnamed: 0     0
Time           0
Country        0
Series         0
Pay period     0
value         68
dtype: int64

Identify if there are duplicated entires.

In [9]:
minwages.duplicated().sum()

0

How many unique values per row are there.  Do these look reasonable for the data type and what you know about what is stored in the column?

In [10]:
minwages.nunique()

Unnamed: 0    1408
Time            11
Country         32
Series           2
Pay period       2
value         1289
dtype: int64

### Task 2: Explore More
Retrieve descriptive statistics for the data.

In [11]:
minwages.describe()

Unnamed: 0.1,Unnamed: 0,value
count,1408.0,1340.0
mean,703.5,5697.843084
std,406.598901,7475.920784
min,0.0,0.234
25%,351.75,4.388742
50%,703.5,290.606495
75%,1055.25,10501.7305
max,1407.0,25713.797


Identify all of the countries listed in the data.

In [12]:
minwages['Country'].unique()

array(['Ireland', 'Spain', 'Australia', 'Turkey', 'Luxembourg',
       'New Zealand', 'United Kingdom', 'Mexico', 'Greece',
       'Slovak Republic', 'Portugal', 'France', 'United States', 'Japan',
       'Netherlands', 'Estonia', 'Hungary', 'Poland', 'Czech Republic',
       'Canada', 'Korea', 'Slovenia', 'Chile', 'Israel', 'Belgium',
       'Germany', 'Brazil', 'Russian Federation', 'Lithuania', 'Latvia',
       'Colombia', 'Costa Rica'], dtype=object)

Convert the time column to a datetime object.

In [13]:
minwages['Time'] = pd.to_datetime(minwages['Time'])

Identify the time points that were used for data collection. How many years of data collection were there? What time of year were the data collected?

In [14]:
minwages['Time'].unique()

array(['2006-01-01T00:00:00.000000000', '2007-01-01T00:00:00.000000000',
       '2008-01-01T00:00:00.000000000', '2009-01-01T00:00:00.000000000',
       '2010-01-01T00:00:00.000000000', '2011-01-01T00:00:00.000000000',
       '2012-01-01T00:00:00.000000000', '2013-01-01T00:00:00.000000000',
       '2014-01-01T00:00:00.000000000', '2015-01-01T00:00:00.000000000',
       '2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

Because we only have one data point collected per year per country, simplify this by adding a new column with just the year.  Print the first 5 rows to confirm the column was added.

In [None]:
minwages['Year'] = minwages['Time'].dt.year
minwages.head()

There are two pay periods.  Retrieve them in a list of just the two strings

In [None]:
minwages['Pay period'].unique()

### Task 3: Clean the data
We have no duplicates in this data so we do not need to consider removing those, but we do have missing values in the `value` column. Lets remove those.  Check the dimensions afterwards to make sure they rows with missing values are gone.

In [None]:
minwages.dropna(inplace=True)
minwages.shape

### Task 4:  Indexing
Use boolean indexing to retrieve the rows of annual salary in United States

In [None]:
minwages[(minwages['Country'] == "United States") & (minwages['Pay period'] == 'Annual')]

Do we have enough data to calculate descriptive statistics for annual salary in the United States in 2016?

In [None]:
minwages[(minwages['Country'] == "United States") & (minwages['Pay period'] == 'Annual') & (minwages['Year'] == 2016)]

Use loc to calculate descriptive statistics for the hourly salary in the United States and then again separately for Ireland. Hint: you will have to set row indexes.

In [None]:
minwages.index = minwages['Country']

In [None]:
minwages[minwages['Pay period'] == 'Hourly'].loc['United States'].describe()

In [None]:
minwages[minwages['Pay period'] == 'Hourly'].loc['Ireland'].describe()

Now do the same for Annual salary

In [None]:
minwages[minwages['Pay period'] == 'Annual'].loc['Ireland'].describe()

In [None]:
minwages[minwages['Pay period'] == 'Annual'].loc['United States'].describe()