# Turing Machine and Deep Learning Lecture 1
_Author: Satchit Chatterji (satchit.chatterji@gmail.com)_

## Notebook 1: Recap of Python for Data Science

This notebook acts as a quick recap of some of the concepts encountered during the course *Python for Data Science* that may be relevent in this the current course also.

#### Learning outcomes:
- Using numpy, matplotlib, pandas
- Reading in CSVs
- Summary statistics
- Cleaning and preprocessing datasets
- Data exploration
- Data visualization

#### Topics of PDS not covered now, since it'll appear later in TMLDL
- One-hot encoding
- Use K-means
- Polynomial fitting
- Under+overfitting

First, we load the data and do what we usually do when we load data!

https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
df = pd.read_csv("tmdb_5000_movies.csv")

In [None]:
df

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df = df.dropna()

In [None]:
df = df.loc[df["budget"]>0]
df = df.loc[df["revenue"]>0]
df = df.loc[df["runtime"]>0]

In [None]:
df.describe()

## Exploration + Visualization

First, let's look at some numeric variables, since they need the least preprocessing.

In [None]:
df = df.sort_values(by=["budget"])
df.plot.scatter(x="revenue", y="vote_count")

In [None]:
df.corr()

In [None]:
import seaborn as sns
sns.heatmap(df.drop(["id"], axis=1).corr(), annot=True)
plt.show()

In [None]:
df["roi"] = df["revenue"]/df["budget"]
plt.hist(df[df["roi"]<20]["roi"], bins=100)
plt.xlim(0,20)
plt.show()

In [None]:
print(max(df["roi"]))
df = df.sort_values(by=["roi"], ascending=False)
df.iloc[0]

### Non-numeric data

Let's look at genres now. We first should process the strings in the genre column to a list of useful genre names (we don't really care about the ids for now).

In [None]:
df["genres"].iloc[0]

In [None]:
def process_genre_string(gs):
    gs = eval(gs)
    gs = [x['name'] for x in gs]
    return gs

processed_genres = []
for index, row in df.iterrows():
    processed_genres.append(process_genre_string(row["genres"]))

df["proc_genres"] = processed_genres

In [None]:
df[["proc_genres", "title"]]

1. Unique labels?
2. Histogram of genres in dataset?
3. ...?

In [None]:
all_genres = []
for genres in df["proc_genres"]:
    all_genres += genres

print(set(all_genres))

In [None]:
genre_counts = {genre:all_genres.count(genre) for genre in set(all_genres)}
genre_counts

In [None]:
genre_counts = dict(sorted(genre_counts.items(), key=lambda item: item[1], reverse=True))
xs = list(range(len(genre_counts)))
plt.figure(figsize=(10,10))
plt.bar(xs, genre_counts.values())
plt.xticks(xs, labels=genre_counts.keys(), rotation=90)
plt.show()

In [None]:
genre_rev = {k:list() for k in genre_counts.keys()}
genre_rev_mean = {k:0 for k in genre_counts.keys()}
genre_rev_std = {k:0 for k in genre_counts.keys()}


for index, row in df.iterrows():
    for genre in row["proc_genres"]:
        genre_rev[genre].append(row["roi"])

for genre in genre_rev.keys():
    genre_rev_mean[genre] = np.mean(genre_rev[genre])
    genre_rev_std[genre]  = np.std(genre_rev[genre])
        
plt.bar(xs, genre_rev_mean.values())
# plt.errorbar(xs, genre_rev_mean.values(), yerr=list(genre_rev_std.values()), linestyle="None")

plt.xticks(xs, labels=genre_counts.keys(), rotation=90)
plt.show()