#Python Basics and Exploratory Data Analysis with Pandas

##Procedural Statements
Procedural statements are literal statements that can be issued one line at a time. Below are types of procedural statements. These statements can be run in:

* Jupyter Notebook
* IPython shell
* Python interpreter
* Python scripts



### Printing

In [None]:
print("Hello world!")

Hello world!


###Creating a variable

In [None]:
string_first_name = "Eric"; print(string_first_name)

Eric


###Multiple procedural statements

In [None]:
string_last_name = "Bloomfield"
string_full_name = string_first_name + " " + string_last_name
print(string_full_name)

Eric Bloomfield


###F-Strings

In [None]:
print(f"{string_full_name} has the coolest python demonstration")
print(f"%s has the coolest python demonstration" % string_full_name)

Eric Bloomfield has the coolest python demonstration
Eric Bloomfield has the coolest python demonstration


##Different data structures
Python has the ability to handle many different data structures such as lists, tuples, and sets. The most powerful and common python data structure is the dictionary. If you think of a dictionary in real life, it contains a word and a definition for each item inside it. A python dictionary maps a key (word) to a value (definition) and provides a simple way to store data in pairs.

In [None]:
pizzas = {"plain": 10.00, 
               "sausage": 13.50, 
               "peppers and onion": 12.00, 
               "hawaiian": 14.50}
pizzas

{'hawaiian': 14.5, 'peppers and onion': 12.0, 'plain': 10.0, 'sausage': 13.5}

###Iterating over a dictionary

In [None]:
#Topping is our key, and price is the value
for topping, price in pizzas.items():
  print(f"A {topping} pizza costs ${price:0.2f}") 
  #The addition of :0.2f allows us to format the number to display with two 
  #decimal places. Using the iterative method allows us to easily apply format
  #to a whole list!

A plain pizza costs $10.00
A sausage pizza costs $13.50
A peppers and onion pizza costs $12.00
A hawaiian pizza costs $14.50


##Using functions and working with dataframes

###Functions
Functions allow you to automate repetitive calculations and help for more efficient, easy to understand code

In [None]:
def calc_conversion(binds, quotes):
  conversion = binds/quotes
  return conversion

In [None]:
print(calc_conversion(100, 2000))
print(calc_conversion(1783289, 2564378))

0.05
0.695408009271644


###Importing data with Pandas
Pandas is a common python package which allows for manipulation of dataframes. Here, we load a .csv file from my github as a dataframe and will do some calculations with it.

In [None]:
import pandas as pd
!git clone https://github.com/the-eric-bloomfield/Python-P2P-OH.git
%cd Python-P2P-OH/
PL_Rating_Downsample = pd.read_csv('PL_Rating_Downsample.csv')
PL_Rating_Downsample.head()

Cloning into 'Python-P2P-OH'...
remote: Counting objects: 30, done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 30 (delta 11), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (30/30), done.
/content/Python-P2P-OH


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,Rater,Owner_Master_Safeco_Agency_Cd,Primary_Safeco_Agency_Code,Rating_State_Abbr,App_Transaction_Timestamp,App_Origination_Date,Real_Time_Rater_Client_Name,First_App_Interest_ID,Last_App_Interest_ID,...,V2Usedesc,V2Year,Collision,Rental,Towing,PD,BI,RT_FLG,D1Age,Rand
0,1,PL Rating,3611136,3610132,VA,21SEP17:13:11:12,09/21/2017,SETWrite,834455705,834769588,...,Pleasure use,2007,100,?,Y,100,100/300,YES,77.0,4
1,2,PL Rating,11351560,11352170,KY,19FEB18:14:14:57,02/19/2018,SETWrite,866298509,876432266,...,Low Mileage (True Pricing),2005,1000,35S,Y,100,250/500,NO,89.0,4
2,3,PL Rating,11581838,11581838,NY,22MAY18:14:24:35,05/22/2018,SETWrite,888888408,888888410,...,Pleasure use,2013,500,50S,Y,100,100/300,YES,58.0,4
3,4,PL Rating,11581838,11581838,NY,01AUG17:14:19:04,08/01/2017,SETWrite,822401541,822401545,...,Pleasure use,2005,500,50S,Y,50,100/300,YES,35.5,4
4,5,PL Rating,11502796,11502796,PA,23FEB18:13:26:37,02/23/2018,SETWrite,867535700,867535702,...,?,?,250,50S,Y,100,250/500,NO,86.5,4


###Calculations with dataframes
Manipulating data in Python is very similar to R. Let's take a subset of our data to focus only on NY and TX.

In [None]:
df_tx_ny = PL_Rating_Downsample[(PL_Rating_Downsample['Rating_State_Abbr']=='NY') | (PL_Rating_Downsample['Rating_State_Abbr']=='TX')]
df_tx_ny[:10]

Unnamed: 0.1,Unnamed: 0,Rater,Owner_Master_Safeco_Agency_Cd,Primary_Safeco_Agency_Code,Rating_State_Abbr,App_Transaction_Timestamp,App_Origination_Date,Real_Time_Rater_Client_Name,First_App_Interest_ID,Last_App_Interest_ID,...,V2Usedesc,V2Year,Collision,Rental,Towing,PD,BI,RT_FLG,D1Age,Rand
2,3,PL Rating,11581838,11581838,NY,22MAY18:14:24:35,05/22/2018,SETWrite,888888408,888888410,...,Pleasure use,2013,500,50S,Y,100,100/300,YES,58.0,4
3,4,PL Rating,11581838,11581838,NY,01AUG17:14:19:04,08/01/2017,SETWrite,822401541,822401545,...,Pleasure use,2005,500,50S,Y,50,100/300,YES,35.5,4
8,9,PL Rating,11581087,11581087,NY,28DEC17:10:04:05,12/28/2017,SETWrite,854587737,854587739,...,?,?,500,50S,Y,100,100/300,NO,26.0,4
10,11,PL Rating,11580251,11580251,NY,20FEB18:11:16:30,02/20/2018,SETWrite,866545130,866545134,...,?,?,?,?,?,50,50/100,NO,33.0,4
14,15,PL Rating,11580476,11580476,NY,13APR18:09:15:04,04/13/2018,SETWrite,879563578,879563590,...,?,?,500,35S,Y,50,100/300,NO,37.5,4
15,16,PL Rating,6376567,6343479,TX,01MAY18:11:29:46,05/01/2018,SETWrite,883688827,883688829,...,?,?,500,35,Y,25,30/60,YES,47.5,4
17,18,PL Rating,11580120,11580120,NY,04JUN18:12:40:56,06/04/2018,SETWrite,891636466,891636470,...,Pleasure use,2004,250,50S,Y,100,100/300,NO,52.0,4
20,21,PL Rating,11581509,11581509,NY,15MAY18:17:45:12,05/15/2018,SETWrite,887204200,887204202,...,?,?,500,50S,Y,100,100/300,YES,50.0,4
22,23,PL Rating,11580693,11580693,NY,26APR18:16:30:46,04/26/2018,SETWrite,882683328,882683338,...,?,?,500,50S,Y,?,?,NO,57.5,4
23,24,PL Rating,11580671,11580671,NY,30MAR18:12:52:49,03/30/2018,SETWrite,876131282,876131284,...,Low Mileage (True Pricing),2011,200,?,?,100,250/500,NO,68.5,4


Now, let's use our function from before to calculate conversion for each state

In [None]:
ny_conversion = calc_conversion(df_tx_ny.loc[df_tx_ny.Rating_State_Abbr == "NY", "Issue_Cnt"].sum(), df_tx_ny.loc[df_tx_ny.Rating_State_Abbr == "NY", "Quote_Cnt"].sum())
tx_conversion = calc_conversion(df_tx_ny.loc[df_tx_ny.Rating_State_Abbr == "TX","Issue_Cnt"].sum(),  df_tx_ny.loc[df_tx_ny.Rating_State_Abbr == "TX", "Quote_Cnt"].sum())
conversions = {"NY":ny_conversion, 
                "TX":tx_conversion}

for state, conversion in conversions.items():
  print(f"{state}'s conversion is {conversion:.2%}")

NY's conversion is 2.25%
TX's conversion is 4.24%


There's several interesting tools being employed here:


1.   ".loc" is a function within pandas that allows us to subset our data based on conditions or location within our dataframe. Here, it allows us to say "Identify rows where state column = New York/TX and then sum the issues/quotes column"
2.   Those conditional sums are then passed to our previous function, calc_conversion which divides our issues by our quotes
3. We store the conversions in a dictionary, where our key is state and our value is the conversion
4. We then iterate over that list, and print each state's conversion formatted as a percentage



#Modeling using a decision tree classifier
One of the most powerful features of Python is the ability to easily deploy advanced statistical models used in data science. Using a decision tree allows us to gain easily interpretable insights into our data and identify the most important dependent variables for making predictions. 

##Decision Trees Overview
A decision tree is a decision support model with a flowchart-like structure which weighs choices with their possible outcomes and costs associated. A decision tree works by splitting data into subsets based on conditional rules, until subsets are formed which represent a class uniquely. These subsets are detemined using an algorithm which prioritizes "information gain". Information gain is calculated based on how homogenous the resulting subsets are.

![alt text](https://cdn-images-1.medium.com/max/800/1*JAEY3KP7TU2Q6HN6LasMrw.png)

This simple example begins with a set of people, then subsets based on variable conditions such as age, pizzas eaten, and exercise habits. Once we reach the end of the tree (terminal leaf nodes), each subset's members belong to one of two distinct classes: fit or unfit.

##Modeling the danger of common mushrooms
![alt text](https://raw.githubusercontent.com/the-eric-bloomfield/Python-P2P-OH/master/2018-07-24%2013_49_41-Presentation1%20-%20PowerPoint.png)

One of these mushrooms is the prized Agaricus Augustus, or "Prince Mushroom", one of the finest edible species of mushroom. The other is Lepiota Subincarnata, or "Deadly Parasol", a mushroom found in North American forest that can cause deadly liver damage. How can we tell which is which? Mycologists can tell that the specimen on the left is a choice edible, while the one on the right will kill you by looking at a few key variables. Notice that the agaricus has a smooth stalk with a scaly cap, while the lepiota has a fibrous stalk and smooth cap. Though not pictured, the agaricus also has chocolate brown gills, while the lepiota has white gills. To a trained expert, this identification is a piece of cake but what about us laypeople? To identify whether a mushroom is edible or poisonous, we'll build a predictive model to aid in our foraging.

##Loading data
First, we'll load a new dataset into Pandas which contains some key identifying information for species of lepiota and agaricus mushrooms. This set was coded by the UC Irvine center for machine learning from the Audobon Society field guide to mushrooms. It contains information on cap shape, cap color, spore color, veil type, habitat, rarity and more.

In [None]:
df_shrooms = pd.read_csv('lepiota_agaricus.csv')

##Formatting data
Next, we'll convert the single letter codes in our data to their full names so the data is easier to interpret. Additionally, we'll change "Edibility" to a binary 1 and 0 indicator for use in our model. The line at the end displays the first ten rows of data.

In [None]:
df_shrooms['Edibility'].replace(['e','p'],[1,0], inplace=True)
df_shrooms['Cap_Shape'].replace(['b','c', 'x','f', 'k', 's'],['bell','conical', 'convex','flat', 'knobbed', 'sunken'], inplace=True)
df_shrooms['Cap_Surface'].replace(['f','g', 'y','s'],['fibrous','grooves', 'scaly','smooth'], inplace=True)
df_shrooms['Cap_Color'].replace(['n','b','c','g','r','p','u','e','w','y'],[' brown','buff','cinnamon','gray','green',' pink','purple','red','white','yellow'],inplace=True)
df_shrooms['Bruises'].replace(['t','f'],[' bruises','no'],inplace=True)
df_shrooms['Odor'].replace(['a','l','c','y','f','m','n','p','s'],[' almond','anise','creosote','fishy','foul','musty','none','pungent','spicy'],inplace=True)
df_shrooms['Gill_Attachment'].replace(['a','d','f','n'],[' attached','descending','free','notched'],inplace=True)
df_shrooms['Gill_Spacing'].replace(['c','w','d'],[' close','crowded','distant'],inplace=True)
df_shrooms['Gill_Size'].replace(['b','n'],[' broad','narrow'],inplace=True)
df_shrooms['Gill_Color'].replace(['k','n','b','h','g','r','o','p','u','e','w','y'],[' black','brown','buff','chocolate','gray','green','orange','pink','purple','red','white','yellow'],inplace=True)
df_shrooms['Stalk_Shape'].replace(['e','t'],[' enlarging','tapering'],inplace=True)
df_shrooms['Stalk_Root'].replace(['b','c','u','e','z','r','?'],[' bulbous','club','cup','equal','rhizomorphs','rooted','missing'],inplace=True)
df_shrooms['Stalk_Surface_Above_Ring'].replace(['f','y','k','s'],[' fibrous','scaly','silky','smooth'],inplace=True)
df_shrooms['Stalk_Surface_Below_Ring'].replace(['f','y','k','s'],[' fibrous','scaly','silky','smooth'],inplace=True)
df_shrooms['Stalk_Color_Above_Ring'].replace(['n','b','c','g','o','p','e','w','y'],[' brown','buff','cinnamon','gray','orange','pink','red','white','yellow'],inplace=True)
df_shrooms['Stalk_Color_Below_Ring'].replace(['n','b','c','g','o','p','e','w','y'],[' brown','buff','cinnamon','gray','orange','pink','red','white','yellow'],inplace=True)
df_shrooms['Veil_Type'].replace(['p','u'],[' partial','universal'],inplace=True)
df_shrooms['Veil_Color'].replace(['n','o','w','y'],[' brown','orange','white','yellow'],inplace=True)
df_shrooms['Ring_Number'].replace(['n','o','t'],[' none','one','two'],inplace=True)
df_shrooms['Ring_Type'].replace(['c','e','f','l','n','p','s','z'],[' cobwebby','evanescent','flaring','large','none','pendant','sheathing','zone'],inplace=True)
df_shrooms['Spore_Print_Color'].replace(['k','n','b','h','r','o','u','w','y'],[' black','brown','buff','chocolate','green','orange','purple','white','yellow'],inplace=True)
df_shrooms['Population'].replace(['a','c','n','s','v','y'],[' abundant','clustered','numerous','scattered','several','solitary'],inplace=True)
df_shrooms['Habitat'].replace(['g','l','m','p','u','w','d'],[' grasses','leaves','meadows','paths', 'urban', 'waste','woods'],inplace=True)

df_shrooms[:10]

Unnamed: 0,Index,Edibility,Cap_Shape,Cap_Surface,Cap_Color,Bruises,Odor,Gill_Attachment,Gill_Spacing,Gill_Size,...,Stalk_Surface_Below_Ring,Stalk_Color_Above_Ring,Stalk_Color_Below_Ring,Veil_Type,Veil_Color,Ring_Number,Ring_Type,Spore_Print_Color,Population,Habitat
0,1,0,convex,smooth,brown,bruises,pungent,free,close,narrow,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
1,2,1,convex,smooth,yellow,bruises,almond,free,close,broad,...,smooth,white,white,partial,white,one,pendant,brown,numerous,grasses
2,3,1,bell,smooth,white,bruises,anise,free,close,broad,...,smooth,white,white,partial,white,one,pendant,brown,numerous,meadows
3,4,0,convex,scaly,white,bruises,pungent,free,close,narrow,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
4,5,1,convex,smooth,gray,no,none,free,crowded,broad,...,smooth,white,white,partial,white,one,evanescent,brown,abundant,grasses
5,6,1,convex,scaly,yellow,bruises,almond,free,close,broad,...,smooth,white,white,partial,white,one,pendant,black,numerous,grasses
6,7,1,bell,smooth,white,bruises,almond,free,close,broad,...,smooth,white,white,partial,white,one,pendant,black,numerous,meadows
7,8,1,bell,scaly,white,bruises,anise,free,close,broad,...,smooth,white,white,partial,white,one,pendant,brown,scattered,meadows
8,9,0,convex,scaly,white,bruises,pungent,free,close,narrow,...,smooth,white,white,partial,white,one,pendant,black,several,grasses
9,10,1,bell,smooth,yellow,bruises,almond,free,close,broad,...,smooth,white,white,partial,white,one,pendant,black,scattered,meadows


Next, we need to change our data into a numeric type so our model can make sense of it. Using the pandas function "get_dummies" we can quickly turn our data into binary 1s and 0s representing each possible value for identification. We then print the first ten rows to see the changes. Note how we went from 24 columns to 119.

In [None]:
df_shrooms = pd.get_dummies(df_shrooms)
df_shrooms[:10]

Unnamed: 0,Index,Edibility,Cap_Shape_bell,Cap_Shape_conical,Cap_Shape_convex,Cap_Shape_flat,Cap_Shape_knobbed,Cap_Shape_sunken,Cap_Surface_fibrous,Cap_Surface_grooves,...,Population_scattered,Population_several,Population_solitary,Habitat_ grasses,Habitat_leaves,Habitat_meadows,Habitat_paths,Habitat_urban,Habitat_waste,Habitat_woods
0,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1,2,1,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,3,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,5,1,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,6,1,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
6,7,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,8,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
8,9,0,0,0,1,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
9,10,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0


##Modeling our data
First, we import the decision tree class from the SciKitLearn (sklearn) package. SciKitLearn is a widely used package with various data science applications including decision trees, support vector machines, neural networks, linear regression and more. We'll be modeling our data on a training set, and then evaluating the accuracy on a test set. These sets are created with the class "train_test_split" through random sampling.

In [None]:
from sklearn import tree 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


In [None]:
shrooms_features = df_shrooms.drop(['Edibility','Index'], axis=1)
s_feature_names = df_shrooms.columns[2:]
shrooms_edibility= df_shrooms['Edibility']
features_train, features_test, edibility_train, edibility_test = train_test_split(shrooms_features, shrooms_edibility, random_state=1)
clf = tree.DecisionTreeClassifier(max_depth = 3)
clf = clf.fit(features_train, edibility_train)

edibility_predict = clf.predict(features_test)
print(accuracy_score(edibility_test, edibility_predict))

0.9891678975873953


![alt text](https://raw.githubusercontent.com/the-eric-bloomfield/Python-P2P-OH/master/2018-07-24%2011_16_45-Document6%20-%20Word.png)


##Model Results
Our accuracy score is 98.9%! This means our model built on training data classifies the testing set very well. However, decision trees have a large tendency to overfit (place too much weight on random noise in our data) and may not respond well to new data. Although results look promising, they should be evaluated further before we operationalize a model like this and decide to eat a potentially dangerous mushroom.

Data science algorithms are great tools for analysis and decision making, but it is up to the analyst to thoroughly vet their results and consider output with a grain of salt. It is very easy to build a model on noise and trick yourself into believing it is very predictive. 

Arist and critic James Bridle built a machine learning model that would predict the results of the Brexit referendum by modeling the relationship between weather and opinion polling. This demonstrates the dangers of placing so much emphasis on these techniques. Read more about the project [here](https://creators.vice.com/en_us/article/53wx33/weather-political-machine-learning-data-art)

Nate Silver, famed economist and statistician, had made a name for himself and his website [fivethirtyeight.com](https://www.fivethirtyeight.com) by successfully calling the results for 49 of 50 states in the 2008 election and every state in Barack Obama's 2012 reelection. By 2016, he had predicted between an 85% and 95% probability of Hillary Clinton winning the election by utilizing advanced modeling techniques. We all know how that turned out. Read more[ here](https://www.nytimes.com/2016/11/10/technology/the-data-said-clinton-would-win-why-you-shouldnt-have-believed-it.html)

It is very easy to get models to return the results you want to see, and any individual model is only as effective as the validity of its underlying data and the integrity of the analyst. Statistical modeling is a useful tool in the analyst's belt, but not a magic bullet for any analytical or decision making data problem. 

##Further Resources

###Python
[Kaggle](https://www.kaggle.com)

[CodeCademy Python Course](https://www.codecademy.com/learn/learn-python)

###Statistics

[An Introduction to Statistical Learning with Applications in R](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)