# Motivation

[Mark Cuban](https://www.cnbc.com/2019/03/18/billionaire-shark-tank-judge-mark-cuban-if-i-were-to-start-a-business-today-heres-what-it-would-be.html)

 “As big as PCs were an impact, as big as the internet was, AI is just going to dwarf it. And if you don’t understand it, you’re going to fall behind. Particularly if you run a business.”

“I mean, I get it on Amazon and Microsoft and Google, and I run their tutorials. If you go in my bathroom, there’s a book, ‘Machine Learning for Idiots.’ Whenever I get a break, I’m reading it”

“If you don’t know how to use it and you don’t understand it and you can’t at least at have a basic understanding of the different approaches and how the algorithms work, you can be blindsided in ways you couldn’t even possibly imagine.”

## Some applications of ML

1. Hedge funds use satellite data of parking lots to predict growth of companies
2. Using geolocation data to predict footfall of shopping malls
3. Banking: analyze customer spending patterns and preemptively propose loan
4. Proposing travel insurance as soon as you buy a plane ticket
5. Analyzing credit risk
6. [vPhrase](https://www.vphrase.com/): generating natural language from portfolio data for investors to summarize the portfolio; personalized reports for branch managers


# Prerequisites

1. Install Anaconda from https://anaconda.org/
2. Create a new environment with the required packages: ```conda create -n summer_school ipython jupyter matplotlib pandas scikit-learn tensorflow nltk```
3. To activate it: ```conda activate summer_school```
4. To deactivate: ```conda deactivate```

In [None]:
# run this only the first time and download "book"
# import nltk
# nltk.download()

# Jupyter notebook basics

- This is a markdown cell, the next ones are code cells
- The notebook is running a kernel (here, Python), the code cells are executed by this kernel
- Very useful to do any command or check shortcuts: p
- Move around: arrow keys or j, k
- Run a cell and go to next cell: Shift+Enter
- Go to edit mode to edit a cell: click or Enter
- Get help about while editing: Shift-Tab, press twice to get more help
- Back to command mode from edit mode: Esc
- New cell above: a
- New cell below: b
- Delete cell: x
- Run a cell and insert new cell below: Alt-Enter

# Training a simple neural network

In [None]:
# this is a code cell that contains Python code
# we usually start with the imports
# these are the imports we usually use for machine learning
import numpy as np
import scipy
import scipy.sparse as sps
import matplotlib.pyplot as plt
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import tensorflow as tf
from nltk.corpus import movie_reviews

In [None]:
num_of_features = 5000

## Loading the dataset

Dataset URL: https://www.kaggle.com/neiljs/all-shark-tank-us-pitches-deals

In [None]:
df = pd.read_csv('Sharktankpitchesdeals.csv')
df.head()

In [None]:
for pitch in df.loc[:3, 'Pitched_Business_Desc']:
    print(pitch)
    print('-----------------------')

In [None]:
corpus = [pitch for pitch in df.loc[:, 'Pitched_Business_Desc']]
corpus[:3]

In [None]:
targets = [deal for deal in df.loc[:, 'Deal_Status']]
targets[:5]

In [None]:
set(targets)

### Bag of words representation

In [None]:
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=20)
bows = count_vectorizer.fit_transform(corpus)
pd.DataFrame(bows.toarray(), columns=count_vectorizer.get_feature_names()).head()

In [None]:
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=num_of_features)
bows = count_vectorizer.fit_transform(corpus)
print("We have {} pitches.".format(bows.shape[0]))

### Producing training and test data

In [None]:
# the problem: we have sparse arrays, but neural network need dense arrays!
# the solution will be word embeddings, here we just convert to dense arrays
bows = bows.toarray().astype(np.float32)
targets = np.array(targets, dtype=np.float32)

In [None]:
num_of_train = 600
X_train, y_train = bows[:num_of_train], targets[:num_of_train]
X_test, y_test = bows[num_of_train:], targets[num_of_train:]

In [None]:
X_train

In [None]:
X_train[0]

In [None]:
# the sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

### The weights of a single neuron are in a vector

![title](nn_vector.png)

In [None]:
w = np.array([1, 2, 3])
x = np.array([1, 2, 3])
w @ x

### The weights of a layer of neurons are in a matrix
![title](nn_matrix.png)

In [None]:
w = np.array([[1, 2, 3], [1, 1, 1], [2, 2, 2]])
x = np.array([1, 2, 3])
w @ x

#### bias

In [None]:
b = [3, 4, 5]
w @ x + b

#### activation function

In [None]:
sigmoid(w @ sigmoid(w @ x + b) + b)

In [None]:
x = np.arange(-7, 7, 0.01)
fix, ax = plt.subplots(1, 1, figsize=(20, 10))
ax.plot(x, sigmoid(x))

In [None]:
# the relu activation function
x = np.arange(-7, 7, 0.01)
fix, ax = plt.subplots(1, 1, figsize=(20, 10))
ax.plot(x, [max(xe, 0) for xe in x])

### Optimization algorithm: some kind of gradient descent

![title](Gradient_descent.gif)

### Loss function: binary crossentropy

If $y_i$ are the true labels, and $\hat{y}_i$ are the predictions of the network:

$- \frac{1}{N} \sum_{i=1}^{N} y_i * log(\hat{y}_i) + (1-y_i)*log(1-\hat{y}_i)$

In [None]:
x = np.arange(0.001, 1.2, 0.01)
fix, ax = plt.subplots(1, 1, figsize=(20, 10))
ax.plot(x, -np.log(x))

## Computational graph
![title](tensors_flowing.gif)

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(20, activation=tf.keras.activations.relu),
    tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid)
])

In [None]:
# we compile our neural network model
# we also have to choose an optimizer and a loss function
# for a binary classification task usually binary cross-entropy is fine
# we use accuracy as the metric
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# training or in other words, fitting the model to the data
model.fit(X_train, y_train, epochs=10, validation_split=0.1)

In [None]:
# looks very good, but
# evaluating on the test set
model.evaluate(X_test, y_test)

## Let's try with another dataset!
Movie reviews - positive or negative

In [None]:
print(movie_reviews.raw('neg/cv000_29416.txt'))

In [None]:
corpus, targets = zip(*[(movie_reviews.raw(fileid), category)
                         for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)])

In [None]:
targets[:10]

In [None]:
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=num_of_features)
bows = count_vectorizer.fit_transform(corpus)
print("We have {} documents.".format(bows.shape[0]))

In [None]:
set(targets)

In [None]:
# convert targets to numbers
targets = np.array([0 if target == 'neg' else 1 for target in targets])
targets[:30]

In [None]:
# we need to shuffle
perm = np.random.permutation(len(targets))
bows = bows[perm]
targets = targets[perm].astype(np.float32)

In [None]:
bows = bows.toarray().astype(np.float32)

In [None]:
num_of_train = 1800
X_train, y_train = bows[:num_of_train], targets[:num_of_train]
X_test, y_test = bows[num_of_train:], targets[num_of_train:]

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(20, activation=tf.keras.activations.relu),
    tf.keras.layers.Dense(1, activation=tf.keras.activations.sigmoid)
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, validation_split=0.1)

In [None]:
model.evaluate(X_test, y_test)