# Python for Data Science 101

# 1. Environment Setup

<img src="images/anaconda_logo.png" style="float: none; margin: 10px 10px 0 0; width: 200px;"/>

**Navigate to:** https://www.anaconda.com/download/

**What is it?** Anaconda is a free and open-source distribution of the Python that aims to simplify package management and deployment. It includes the most commonly used libraries, as well as pip, the Python package manager used to install and manage new libraries.

**Step 1**

<img src="images/image_01a.png" style="width: 800px;"/>

**Step 2**
<img src="images/image_01b.png" style="width: 800px;"/>

**Step 3**
<img src="images/image_01c.png" style="width: 800px;"/>

<img src="images/git_logo.png" style="float: none; margin: 15px 15px 0 0; width: 100px;"/>

**Navigate to:** https://git-scm.com/download

**What is it?** Git is a distributed version control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files.

<img src="images/image_02.png" style="width: 800px;"/>

<img src="images/github_logo.png" style="float: none; margin: 15px 15px 0 0; width: 200px;"/>

**Navigate to:** https://github.com/

**What is it?**  Github is a web-based hosting service for version control using Git. It offers all of the distributed version control and source code management functionality of Git as well as access control and several collaboration features such as bug tracking, feature requests, task management, and wikis for every project

<img src="images/image_03.png" style="width: 800px;"/>

<img src="images/aws_logo.png" style="float: none; margin: 15px 15px 0 0; width: 200px;"/>

**Navigate to:** https://aws.amazon.com/free/

**What is it?** Amazon Web Services (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms to individuals, companies and governments, on a paid subscription basis. The technology allows subscribers to have at their disposal a virtual cluster of computers, available all the time, through the Internet.

<img src="images/image_04.png" style="width: 800px;"/>

<img src="images/gcp_logo.png" style="float: none; margin: 15px 15px 0 0; width: 250px;"/>

**Navigate to:** https://cloud.google.com/

**What is it?** Google Cloud Platform (GCP) is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning.

<img src="images/image_05.png" style="width: 800px;"/>

## Required Packages
Using the command line, navigate to the directory where you want to store this training:

`cd`

After you have copied the HTTPS clone link, run the command:

`git clone https://github.com/araveret/training.git`

Navigate into your new repository and install the required packages with the command: 

`pip install -r requirements.txt`

# 2. Base Python
Using the command line, open Spyder with the command: 

`spyder`

Once Spyder opens, drag the **01_base_python.py** file into the text editor on the left side of the integrated developer environment (IDE). 

# 3. Data Collection
## Flat file
**Use Case:** Titanic (CSV)

In [1]:
# load the titanic data in the data folder
import pandas as pd
path = r'./data/titanic.csv'
titanic = pd.read_csv(path)

In [2]:
titanic.columns = ['survived', 'pclass', 'name', 'sex', 'age', 'sib_spouse', 'parent_child', 'fare']

## Database
**Use Case:** Titanic (AWS RDS) 

In [25]:
# import libraries
import psycopg2
from io import StringIO

# create functions to gather all the information you need in a create statement
def get_length(df, column):
    return str(max([len(str(x)) for x in df[column]]))

def get_type(df, column):
    return key[str(df[column].dtypes)]

In [29]:
# create the name of the table and a key for columns types
table = 'titanic'
key = {'object':'varchar',
       'int64':'int',
       'float64':'float',
       'datetime64[ns]':'date',
       'bool':'boolean'
       }

# build your create statement
drop = "DROP TABLE IF EXISTS "+table+" CASCADE;"
create = "CREATE TABLE "+table+"("+ \
        ', '.join([column +' '+ get_type(titanic, column) +'({})'.format(get_length(titanic, column)) \
        if (get_type(titanic, column)!='date')&(get_type(titanic, column)!='int') \
        else column +' '+ get_type(titanic, column) \
        for column in titanic.columns]) +');'
statement = ' '.join([drop, create])

# initialize a string buffer
sio = StringIO()
sio.write(titanic.to_csv(index=None, header=None))
sio.seek(0)

0

In [None]:
# see the string buffer that the last step created
sio.getvalue()

**Create a table and copy data into it**

In [None]:
# create a connection with the database
conn = psycopg2.connect(' '.join(['host=HOST_HERE',
                                  'dbname=DBNAME_HERE',
                                  'user=USER_HERE',
                                  'password=PASSWORD_HERE']))

# create a table to store your data
with conn.cursor() as c:
    c.execute(statement)
    conn.commit()

# copy the string buffer to the table
with conn.cursor() as c:
    c.copy_from(sio, "titanic", columns=titanic.columns, sep=',')
    conn.commit()

# close the connection with the database
conn.close()

**Execute a query and store results as a pandas dataframe**

In [None]:
# create a connection with the database

conn = psycopg2.connect(' '.join(['host=HOST_HERE',
                                  'dbname=DBNAME_HERE',
                                  'user=USER_HERE',
                                  'password=PASSWORD_HERE']))

# fetch and store all data from the table
with conn.cursor() as c:
    c.execute('SELECT * FROM '+table)
    data = c.fetchall()
    
# close the conenction with the database
conn.close()

# convert stored data into a Pandas dataframe
df = pd.DataFrame(data, columns=titanic.columns)

**The above code packaged into a Class** 

In [None]:
class connection():
    def __init__(self, host, dbname, user, password):
        self.host = host
        self.dbname = dbname
        self.user = user
        self.password = password
        self.conn = psycopg2.connect(' '.join(['host='+str(host),
                                  'dbname='+str(dbname),
                                  'user='+str(user),
                                  'password='+str(password)]))
        
    def create_table(self, table, data):
        self.data = data
        self.table = table
        self.key = {'object':'varchar',
           'int64':'int',
           'float64':'float',
           'datetime64[ns]':'date',
           'bool':'boolean'
           }
        
        def get_length(df, column):
            return str(max([len(str(x)) for x in df[column]]))

        def get_type(df, column):
            return self.key[str(df[column].dtypes)]
        
        self.drop = "DROP TABLE IF EXISTS "+self.table+" CASCADE;"
        self.create = "CREATE TABLE "+self.table+"("+ \
                ', '.join([column +' '+ get_type(self.data, column) +'({})'.format(get_length(self.data, column)) \
                if (get_type(self.data, column)!='date')&(get_type(self.data, column)!='int') \
                else column +' '+ get_type(self.data, column) \
                for column in self.data.columns]) +');'
        self.statement = ' '.join([self.drop, self.create])
        with self.conn.cursor() as c:
            c.execute(self.statement)
            self.conn.commit()
    
    def fill_table(self, table, data):
        self.table = table
        self.data = data
        self.sio = StringIO()
        self.sio.write(self.data.to_csv(index=None, header=None))
        self.sio.seek(0)
        with self.conn.cursor() as c:
            c.copy_from(self.sio, self.table, columns=self.data.columns, sep=',')
            self.conn.commit()
    
    def query_table(self, query, columns):
        self.columns = columns
        self.query = query
        with self.conn.cursor() as c:
            c.execute(self.query)
            self.data = c.fetchall()
        self.df = pd.DataFrame(self.data, columns=self.columns)
    
    def close_connection(self):
        self.conn.close()
        

In [None]:
# connect to database
postgres = connection(host='HOST_HERE', 
                      dbname='DBNAME_HERE', 
                      user='USER_HERE', 
                      password='PASSWORD_HERE')

# create the 'titanic' table
postgres.create_table('titanic', titanic)

# fill the 'titanic' table with the titanic dataframe
postgres.fill_table('titanic', titanic)

# query the 'titanic' table and store the data as a dataframe
df = postgres.query_table('titanic', titanic.colums)

# see the top records from the new titanic dataframe
postgres.df

# close the connection
postgres.close_connection()


## Webscraping 
**Use Case:** Python.org Events (HTML)

In [None]:
# import libraries
import requests
from bs4 import BeautifulSoup

# request the html code from a url
url = r'https://www.python.org/events/python-events/'
r = requests.get(url)

In [None]:
# convert HTML into a structured Soup object
b = BeautifulSoup(r.text, 'lxml')

In [None]:
# get the text for one element in the table
b.find_all('ul', attrs={'class':"list-recent-events menu"})[0]

In [None]:
# store all of the elements by column in a dictionary
dct = {'event':[x.text for x in b.find_all('ul', attrs={'class':"list-recent-events menu"})[0].find_all('h3')],
       'date':[x.text for x in b.find_all('ul', attrs={'class':"list-recent-events menu"})[0].find_all('time')],
       'location':[x.text for x in b.find_all('ul', attrs={'class':"list-recent-events menu"})[0].find_all('span', attrs={'class':'event-location'})]
      }

# convert the dictionary into a data table
events = pd.DataFrame(dct)

**Exercise**: Repeat the above process to create a dataframe containing information about recent polls

**Extra**: Create a function called "GetPolls" only for polls a specific date

In [None]:
url = r'https://www.realclearpolitics.com/epolls/latest_polls/'



## API
**Use Case:** Google Maps (REST)

In [None]:
import urllib.parse
import requests
import json

# part 1 - base path of the api
main = r'https://maps.googleapis.com/maps/api/geocode/json?'

# part 2 - the address search component
search = '555 Hamilton Avenue, Palo Alto, CA'

# part 3 - your api developer key to authenticate the search
api_key = r'&key=KEY_HERE'

# put parts 1, 2, and 3 together
url = main + urllib.parse.urlencode({'address':search})+api_key

# make a get request to the api
result = requests.get(url).json()

In [None]:
# using the Python client 
import googlemaps

gmaps = googlemaps.Client(key=r'KEY_HERE')
search = ('555 Hamilton Avenue, Palo Alto, CA')
result = gmaps.geocode(search)

In [None]:
# create your own googlemaps package
class dumbgooglemaps():
    def __init__(self):
        import urllib.parse
        import requests
        import json
        self.main = r'https://maps.googleapis.com/maps/api/geocode/json?'
    def Client(self, key):
        self.key = r'&key='+key
    def geocode(self, search):
        self.search = search
        self.url = self.main + urllib.parse.urlencode({'address':self.search}) + self.api_key
        return requests.get(self.url).json()

In [None]:
# run your own googlemaps package
dgmaps = dumbgooglemaps()
dgmaps.Client(key='KEY_HERE')
result = dgmaps.geocode(search=r'555 Hamilton Avenue, Palo Alto, CA')

**Exercise**: Explore the Star Wars API to find the name of the "homeworld" of "Han Solo"

**Extra**: Create a class called 'StarWarsAPI' with a function called 'CharacterSearch', which returns the JSON data for any character you search for

In [None]:
url = r'http://swapi.co/api/
'

# 4. ETL
## Data Manipulation

In [None]:
# add a column for each passenger's title
titanic['title'] = [x.split('.')[0] for x in titanic['name']]

In [None]:
# add a column to group each passenger by age range
age_group = []

for age in titanic['age']:
    if age < 1:
        age_group.append('00 years')
    elif age <= 4:
        age_group.append('01-04 years')
    elif age <= 9:
        age_group.append('05-09 years')
    elif age <= 14:
        age_group.append('10-14 years')
    elif age <= 19:
        age_group.append('15-19 years')
    elif age <= 24:
        age_group.append('20-24 years')
    elif age <= 29:
        age_group.append('25-29 years')
    elif age <= 34:
        age_group.append('30-34 years')
    elif age <= 39:
        age_group.append('35-39 years')
    elif age <= 44:
        age_group.append('40-44 years')
    elif age <= 49:
        age_group.append('45-49 years')
    elif age <= 54:
        age_group.append('50-54 years')
    else:
        age_group.append('55+ years')
        
titanic['age_group'] = age_group

## Airflow

To install Aiflow on your local machine, enter in your terminal:

`pip install apache-airflow`

Once Airflow has installed, you will need to initiate a database which stores all of the information Airflow needs to run:

`airflow initdb`

Replace your current 'dags' directory, `Users/[YOUR_USERNAME]/airflow/dags/`, with the directory inside your training repository.

Create a 'stage' directory in airflow, beside the 'dags' folder, to store stage data: `Users/[YOUR_USERNAME]/airflow/stage`

Next, run the Airflow scheduler in your terminal:

`airflow scheduler`

Then, in a new terminal window, run the Airflow webserver:

`airflow webserver -p 8080`

**Exercise**

*Step 1* - In a browswer, navigate to `http://localhost:8080/admin/` to interact with your Airflow UI.

*Step 2* - Review the dag 'titanic_example.py' as well as the package 'titanic_main.py' (in the 'library' folder) to better understand how to set up a dag.



Step 3 - When you are finished with Airflow you can close the scheduler and kill the webserver with the following command:

`cat airflow/airflow-webserver.pid | xargs kill -9`

# 5. Statistical Modeling
## Introduction

## Regression

**What is it?**

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

**How do you do it?**

<img src="images/regression_1.png" style="width: 800px;"/>

**How do you evaluate it?**

<img src="images/regression_2.png" style="width: 800px;"/>

## Classification

**What is it?**

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

**How do you do it?**

**How do you evaluate it?**

## Clustering

**What is it?**

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

**How do you do it?**

**How do you evaluate it?**

## Dimensionality Reduction

**What is it?**

Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

**How do you do it?**

**How do you evaluate it?**

## Example (Classification)

In [None]:
from sklearn import tree
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [None]:
y = titanic['survived']

In [None]:
pclass = pd.get_dummies(titanic['pclass'], prefix='pclass').iloc[:,:-1]
sex = pd.get_dummies(titanic['sex'], prefix='sex').iloc[:,:-1]
title = pd.get_dummies(titanic['title'], prefix='title').iloc[:,:-1]
age_group = pd.get_dummies(titanic['age_group'], prefix='age_group').iloc[:,:-1]
X = pd.concat([pclass, sex, title, age_group], axis=1)

In [None]:
dtree = tree.DecisionTreeClassifier(random_state=1)
depth_range = list(range(1,20))
param_grid = dict(max_depth=depth_range)
grid = GridSearchCV(dtree, param_grid, cv=5, scoring='roc_auc')
grid.fit(X, y)

In [None]:
grid_mean_scores = grid.cv_results_['mean_test_score']

In [None]:
plt.figure()
plt.plot(depth_range, grid_mean_scores)
plt.grid(True)
plt.plot(grid.best_params_['max_depth'], grid.best_score_, 'ro', markersize=12, markeredgewidth=1.5, markerfacecolor='None', markeredgecolor='r')

In [None]:
best = grid.best_estimator_

In [None]:
pd.DataFrame(best.feature_importances_, index=X.columns.tolist())[0].sort_values(ascending=False)

In [None]:
preds = best.predict(X)

In [None]:
metrics.accuracy_score(y, preds)

In [None]:
metrics.confusion_matrix(y, preds)

In [None]:
probs = best.predict_proba(X)[:,1]

In [None]:
metrics.roc_auc_score(y, probs)

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y, probs)
plt.plot(fpr, tpr)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('FPR (1 - Specificity)')
plt.ylabel('TPR (Sensitivity)')

**Exercise**

In [None]:
# compare the above model to a Random Forest Classifier tuning the 'n_estimators' and 'max_depth' parameters
from sklearn import ensemble

rforest = ensemble.RandomForestClassifier(random_state=1)


# 6. Model Deployment
## Pickle

In [None]:
import pickle
filename = r'./models/dtree_model.pkl'
pickle.dump(best, open(filename, 'wb'))

In [None]:
loaded_model = pickle.load(open(filename, 'rb'))
metrics.accuracy_score(y, loaded_model.predict(X))

## AWS Endpoint

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb

# 7. Web Application
## Flask