# Talk Recommender - Pycon 2018

Lets start by looking at the data. In the interest of time we have scraped the talk descriptions from 2017 and 2018 pycon and loaded them into a Postgres database. We have a utiltiy function that loads the data from postgres and returns a pandas dataframe.

## Exercise 1: 
Load the data to a pandas data frame by calling `talks_df_from_db` defined in `predict_api.model`.

In [91]:
from predict_api.model import talks_df_from_db
talks_df = talks_df_from_db()
talks_df.head()

Unnamed: 0,id,title,description,presenters,date_created,date_modified,location,talk_dt,year
0,1,5 ways to deploy your Python web app in 2017,You’ve built a fine Python web application and...,Andrew T. Baker,2018-04-19 00:59:20.151875,2018-04-19 00:59:20.151875,Portland Ballroom 252–253,2017-05-08 15:15:00,2017
1,2,A gentle introduction to deep learning with Te...,Deep learning's explosion of spectacular resul...,Michelle Fullwood,2018-04-19 00:59:20.158338,2018-04-19 00:59:20.158338,Oregon Ballroom 203–204,2017-05-08 16:15:00,2017
2,3,aiosmtpd - A better asyncio based SMTP server,smtpd.py has been in the standard library for ...,Barry Warsaw,2018-04-19 00:59:20.161866,2018-04-19 00:59:20.161866,Oregon Ballroom 203–204,2017-05-08 14:30:00,2017
3,4,Algorithmic Music Generation,Music is mainly an artistic act of inspired cr...,Padmaja V Bhagwat,2018-04-19 00:59:20.165526,2018-04-19 00:59:20.165526,Portland Ballroom 251 & 258,2017-05-08 17:10:00,2017
4,5,An Introduction to Reinforcement Learning,Reinforcement learning (RL) is a subfield of m...,Jessica Forde,2018-04-19 00:59:20.169075,2018-04-19 00:59:20.169075,Portland Ballroom 252–253,2017-05-08 13:40:00,2017


## Exercise 2: 
How many talks do we have for the years 2017 and 2018?
Select the talk description columns from the `talks_df` data frame and split them by years. 

In [79]:
talks_df[talks_df.year==2017].head()
talks_df[talks_df.year==2018].head()

Unnamed: 0,id,title,description,presenters,date_created,date_modified,location,talk_dt,year
95,96,A Bit about Bytes: Understanding Python Bytecode,At some point every Python programmer sees Pyt...,James Bennett,2018-04-19 00:59:20.652441,2018-04-19 00:59:20.652441,Grand Ballroom B,2018-03-28 17:10:00,2018
96,97,Adapting from Spark to Dask: what to expect,"Until very recently, Apache Spark has been a d...",Irina Truong,2018-04-19 00:59:20.657577,2018-04-19 00:59:20.657577,Grand Ballroom A,2018-03-29 14:35:00,2018
97,98,All in the timing: How side channel attacks work,"In this talk, you’ll learn about a category of...","Philip James, Asheesh Laroia",2018-04-19 00:59:20.662121,2018-04-19 00:59:20.662121,Grand Ballroom B,2018-03-29 17:10:00,2018
98,99,Analyzing Data: What pandas and SQL Taught Me ...,"“So tell me,” my manager said, “what is an ave...",Alex Petralia,2018-04-19 00:59:20.667578,2018-04-19 00:59:20.667578,Global Center Ballroom AB,2018-03-29 15:15:00,2018
99,100,A practical guide to Singular Value Decomposit...,Recommender systems have become increasingly p...,Daniel Pyrathon,2018-04-19 00:59:20.673779,2018-04-19 00:59:20.673779,Room 26A/B/C,2018-03-29 13:50:00,2018


## Exercise 3:
Next its time to extract features from talk descriptions. In this step we build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk.

In [93]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
vectorized_text = vectorizer.fit_transform(talks_df['description'])
vectorized_text.shape[0]

190

Split the vectorized_text into two parts - the 2017 talks will be used for training and the 2018 talks will we used for predicting.

In [94]:
count_labeled = len(talks_df[talks_df.year == 2017])
vectorized_text_labeled = vectorized_text[:count_labeled]
vectorized_text_predict = vectorized_text[count_labeled:]

## Exercise 4: 
To use supervised learning we start by asking the user to label the Pycon 2017 talks with his/her preference (in person or later) based _only_ on the description of the talk.

We represent the labels as a vector of length 95, where each element corresponds the label for the talk description in the `vectorized_text_labeled` dataframe.  If the value of an element is set to 1, it would indicate the corresponding talk at that index in `labled_text` was selected for in person viewing. A zero would indicate that it is meant for viewing later. By default, for a new user we set every talk to 0. 

```
labels = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0]
```

Then after the person has labeled say 20 talks from 2017 it might look like:

```
labels = [0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0]
```

In [95]:
labels = [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1]
len(labels)

95

Now that we have our lables and the feature sets, we need to split our training data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the `train_test_split` method from `sklearn.model_selection` to split the `vectorized_text_labeled` into training and testing set with the test size as one third of the size of the labeled. 

[Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is the documentation for the function.

In [96]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vectorized_text_predict, labels, test_size=.3)

## Exercise 5:
Finally we get to the stage for training the model. We are going to use a linear support vector machine. And check its accuracy by using the `classification_report` function. Note that we have not done any parameter tuning yet, so your model might not give you the best results. Feel free to tweak the parameters or use a different model to get a better result.

In [97]:
import sklearn
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
classifier = LinearSVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)
print(report)

             precision    recall  f1-score   support

          0       0.62      1.00      0.77        18
          1       0.00      0.00      0.00        11

avg / total       0.39      0.62      0.48        29



  'precision', 'predicted', average, warn_for)


## Exercise 6:
Use the model to predict which talks the user should go to. Print out the talk descriptions.

In [99]:
predicted_talks_vector = classifier.predict(vectorized_text_predict)
df = talks_df_from_db(year=2018)
predicted_talks_indexes = np.array(predicted_talks_vector).nonzero()[0].tolist()
print(f'We found {len(predicted_talks_indexes)} talks that you will enjoy')
df.loc[predicted_talks_indexes][['id', 'description', 'presenters', 'title', 'location']]

We found 24 talks that you will enjoy


Unnamed: 0,id,description,presenters,title,location
2,98,"In this talk, you’ll learn about a category of...","Philip James, Asheesh Laroia",All in the timing: How side channel attacks work,Grand Ballroom B
3,99,"“So tell me,” my manager said, “what is an ave...",Alex Petralia,Analyzing Data: What pandas and SQL Taught Me ...,Global Center Ballroom AB
4,100,Recommender systems have become increasingly p...,Daniel Pyrathon,A practical guide to Singular Value Decomposit...,Room 26A/B/C
5,101,Do we even need humans? Humans and data scienc...,Kelsey Pedersen,Augmenting Human Decision Making with Data Sci...,Grand Ballroom A
7,103,"Nowadays, there are many ways of building data...",Christopher Fonnesbeck,Bayesian Non-parametric Models for Data Scienc...,Global Center Ballroom AB
8,104,Behavior-Driven Development (BDD) is gaining p...,Andrew Knight,Behavior-Driven Python,Grand Ballroom A
10,106,"You've used pytest and you've used mypy, but b...",Hillel Wayne,Beyond Unit Tests: Taking Your Testing to the ...,Room 26A/B/C
11,107,Big-O is a computer science technique for anal...,Ned Batchelder,Big-O: How Code Slows as Data Grows,Grand Ballroom C
12,108,"In the past few years, the power of computer v...",Kirk Kaiser,Birding with Python and Machine Learning,Grand Ballroom C
13,109,"Facebook, Google, Uber, LinkedIn, and friends ...",Sam Kitajima-Kimbrel,Bowerbirds of Technology: Architecture and Tea...,Global Center Ballroom AB
