# Short Flask Tutorial

Start with http://flask.pocoo.org/docs/0.12/quickstart/. Your task is to write an HTTP service that gets a string and returns its length. Write a python client that communicates with the service

In [1]:
from flask import Flask, request, render_template
app = Flask(__name__)

@app.route('/', methods = ["GET","POST"])
def hello_world():
    
    if request.method == 'GET':
      return render_template('form.html')
    else:
      my_string = request.form["baby_name"]
      return str(len(my_string))



In [2]:
!export FLASK_APP=serving_model.py
!flask run

Usage: flask run [OPTIONS]

Error: Could not locate Flask application. You did not provide the FLASK_APP environment variable.

For more information see http://flask.pocoo.org/docs/latest/quickstart/


# Prepare the data

Download https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv

Load it with pandas and build a table with the following columns:

1. Name
1. Mean Percent of boys over the years
1. Mean Percent of girls over the years
1. Total percent (Column2+Column3) / 2
1. IsGirl (= Column3 > Column2)

Sort by total percent and take the top 2000 names.

Sort by name column and take every fifth name to be the test data

In [1]:
import pandas as pd
baby_df = pd.read_csv("./baby-names.csv")

# get mean percent for girls boys
girls_df = pd.DataFrame(baby_df[baby_df.sex == "girl"].groupby("name").mean()['percent'])
boys_df = pd.DataFrame(baby_df[baby_df.sex == "boy"].groupby("name").mean()['percent'])

# prepare data 
pct_df = pd.concat([boys_df,girls_df], axis = 1)
pct_df.columns = ["boys_pct","girls_pct"]
pct_df = pct_df.fillna(0)
pct_df["total_pct"] = pct_df.boys_pct + pct_df.girls_pct
pct_df = pct_df.reset_index()
pct_df.columns = ["names","boys_pct","girls_pct","total_pct"]

# is girl column
pct_df['is_girl'] = pct_df.girls_pct - pct_df.boys_pct
pct_df.is_girl = pct_df.is_girl.apply(lambda val: 1 if val > 0 else 0 )

Using `nltk` package, creare 2-grams of the chracters in each name. You should have 358 features if you lower case. 

notes: 

1. lower case? maybe. maybe not. Why?
1. don't over do this section, use 1-grams if easier and come back to here after finished. 

In [2]:
# prepare features
import nltk
from collections import Counter
X = []
for name in pct_df.names:
    name_ngrams = ["".join(ngram) for ngram in nltk.ngrams(name.lower(), 2)]
    name_ngrams_counts = Counter(name_ngrams)
    X.append(name_ngrams_counts)
feature_df = (pd.DataFrame(X).fillna(0).astype(int))

In [3]:
final_df = pd.concat([pct_df, feature_df], axis = 1)

What is the percent of boys in the data?

In [5]:
pct_df.is_girl.value_counts()
#print "pct_boys:" , round(float(2998) / 3784  * 100,2) ,"%"

1    3784
0    2998
Name: is_girl, dtype: int64

What is the sparsity of the data? Whats the percent of non-zero cells in the feature matrix you created? 

In [6]:
#print "sparsity:",round(float(sum(sum(feature_df.values))) / (feature_df.shape[0]*feature_df.shape[1]) * 100,2) , "%"

SyntaxError: invalid syntax (<ipython-input-6-0a9aff7c59a0>, line 1)

# Train a model

Using Logistic Regression or any other model you like train a model. Evaluate the mode using Accuracy, AUC and Mean Average Percision (average_precision_score) on the train and test sets. Think about regulazrization - you have a lot of features. If you are running out of time, do this quickly and move to the next section. Come back to this later.

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split

names = pct_df.names.values
feature_df["names"] = [name.lower() for name in names]
feature_df = feature_df.set_index("names")

y = pct_df.is_girl.values
#X = final_df.set_index("names")

X_train, X_test, y_train, y_test = train_test_split(feature_df,y,test_size = .33,)
clf = DecisionTreeClassifier()

#print clf
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
#print clf.score(X_test, y_test)



In [9]:
import pickle
f = open('name_classifier.pickle', 'wb')
pickle.dump(clf, f)
f.close()

In [160]:
final_df = final_df.copy().set_index("names")
final_df.head(15)

Unnamed: 0_level_0,boys_pct,girls_pct,total_pct,is_girl,aa,ab,ac,ad,ae,af,...,zg,zh,zi,zl,zm,zo,zr,zu,zy,zz
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aaden,0.000442,0.0,0.000442,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Aaliyah,0.0,0.001317,0.001317,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Aarav,0.000101,0.0,0.000101,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Aaron,0.002266,8.9e-05,0.002355,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ab,4.4e-05,0.0,4.4e-05,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Abagail,0.0,0.000133,0.000133,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Abb,4.6e-05,0.0,4.6e-05,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Abbey,0.0,0.000239,0.000239,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Abbie,4.6e-05,0.000243,0.000289,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Abbigail,0.0,0.000242,0.000242,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
name_grams = []
for gram in nltk.ngrams("amit".lower(), 2):
    name_grams.append("".join(gram))
    
name_grams

['am', 'mi', 'it']

In [165]:
# predict on any name code
import nltk
name = "abbie"
name_grams = []
for gram in nltk.ngrams(name.lower(), 2):
    name_grams.append("".join(gram))
    
pred_features = pd.DataFrame(pd.Series({feature: 1 if feature in name_grams else 0 for feature in feature_df.columns})).transpose()
clf.predict(pred_features)

array([1])

In [180]:
feature_df["is_girl"] = final_df.is_girl.values
feature_df.to_csv("./baby_model.csv")

In [14]:
df = pd.read_csv("./baby_model.csv")

In [31]:
feature_names = [col for col in df.columns if col != 'is_girl' or col != 'names']
import pickle
f = open('feature_names.pickle', 'wb')
pickle.dump(feature_names, f)
f.close()

In [33]:
feature_names.pop("names")

TypeError: 'str' object cannot be interpreted as an integer

# Save you model

using Pickle, save your model to disk

# Serve your model

Using `flask`, create an API that takes a name and decides if its a boy or a girl. Also have an endpoint that recieves a list of names and return a list of genders. 

# Consume you model with python

using `requests`, send requests to your model 

# Put it on Heoruko

use you client to consume the public model. Follow https://devcenter.heroku.com/articles/getting-started-with-python until step 3 (deploy your app)

# Evaluate through web


evaluate your model and your friends models using names not in the top 2000

# Discussion

If this was a commercial service, How could you imporve it? 

1. Data preprocesses 
1. Output type
1. Interface

Give some examples
