In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

## Lets get some data to look at

Movies! IMDB has a lot of information about movies and tv shows and it's easy to download and get.


The documentioan is here:

https://www.imdb.com/interfaces/

Let's look at movie ratings!

In [None]:
ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", sep='\t')
ratings.head()

#### Ok great we have these ratings but what movie are they for?

Looking at the documentation:


#### title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
* tconst (string) - alphanumeric unique identifier of the title
* averageRating – weighted average of all the individual user ratings
* numVotes - number of votes the title has received

We'll have to get the name from somewhere else using "tconst" varialbe to match between dataframes.

#### title.basics.tsv.gz - Contains the following information for titles:
* tconst (string) - alphanumeric unique identifier of the title
* titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
* primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
* originalTitle (string) - original title, in the original language
* isAdult (boolean) - 0: non-adult title; 1: adult title
* startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
* endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
* runtimeMinutes – primary runtime of the title, in minutes
* genres (string array) – includes up to three genres associated with the title

#### Lets load the titles

In [None]:
titles = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", sep='\t')

In [None]:
titles.head()

Let's cleanup some of the data by selecting a type (Movies) and drop a few of the columns we're not going to use.

#### 1) Use titleType to get only movies 
* What are the titleType's?
* How do we get rid of the ones we don't want?

#### 2) Let's drop some things we're not using
* Once we know titleType is Movies we can drop it
* It seems primaryTitle	and originalTitle are the same for most, let's only keep 1
* endYear seems to be only used for tv shows, lets drop endYear and rename startYear to just year.

In [None]:
titles.titleType.unique()

#### Let's select 'movie' and 'tvMovie' and get rid of all the rest.

In [None]:
movies = titles[(titles.titleType == 'movie') | (titles.titleType == 'tvMovie')]

# Also select only non-adult movies to be safe for work
movies = movies[movies.isAdult == 0]

In [None]:
movies.head()

### Let's drop some of our unused columns

Use the [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) method built into pandas.

In [None]:
movies = movies.drop(columns=['isAdult', 'originalTitle', 'titleType', 'endYear', 'genres'])

### Let's match up the movies to their rating now. 

[Merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)

In [None]:
movies = movies.merge(ratings, left_on='tconst', right_on='tconst')

In [None]:
movies.head()

In [None]:
movies.rename(columns={'startYear': 'Year', 'averageRating': 'Rating'}, inplace=True)

### Let's plot something to see how the data looks

In [None]:
movies.plot.scatter('Year', 'Rating');

### Ooops looks like something is wrong

```
ValueError: could not convert string to float: '\\N'
```

#### A ‘\N’ is used to denote that a particular field is missing or null for that title/name.

Let's take out all the numbered values that are null

First let's make them all into numbers and anything that's not get changed to NaN (Not a number)

In [None]:
movies.runtimeMinutes = pd.to_numeric(movies.runtimeMinutes, errors='coerce')
movies.Rating = pd.to_numeric(movies.Rating, errors='coerce')
movies.Year = pd.to_numeric(movies.Year, errors='coerce')
movies.numVotes = pd.to_numeric(movies.numVotes, errors='coerce')
movies.dropna(axis=0, how='any', inplace=True)

movies.plot.scatter('Year', 'Rating');

## Now that we have some clean data let's make some cool plots!

In [None]:
fig = plt.figure(figsize=(16,9))
plt.hist2d(movies.Year, movies.Rating, bins=(100,30), range=((1920,2020),None), cmap='Spectral_r')
plt.show()

In [None]:

from scipy.stats import norm

fig = plt.figure(figsize=(16,9))
plt.hist(movies.runtimeMinutes, bins=180,range=(0, 180), density=True)
mean,std=norm.fit(movies.runtimeMinutes[movies.runtimeMinutes < 180])
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
y = norm.pdf(x, mean, std)
plt.plot(x, y)
plt.show()


In [None]:
top_movies = pd.read_csv("https://raw.githubusercontent.com/thechaudharysab/imdb-data-pandas-visualization/master/data/imdb_1000.csv")
top_movies.head()

In [None]:
movies = movies[movies.Rating  >= np.min(top_movies.star_rating)]
movies.head()

In [None]:
movies = movies.merge(top_movies, left_on='primaryTitle', right_on='title')

movies.head()

In [None]:
fig = plt.figure(figsize=(16,9))

for content_rating in movies.content_rating.unique():
    movie_content_rating = movies[movies.content_rating == content_rating]
    plt.hist(movie_content_rating.star_rating, range=(7,10),label=content_rating)
    
plt.legend()
plt.show()

In [None]:
fig = plt.figure(figsize=(16,9))
sns.jointplot(x=movies.Year, y=movies.duration, kind='kde' , color="blue")
plt.show()

In [None]:
fig = plt.figure(figsize=(16,9))
sns.boxplot(y=movies.duration, x=movies.star_rating)
plt.show()

In [None]:
fig = plt.figure(figsize=(16,9))
sns.violinplot(y=movies.duration, x=movies.star_rating)
plt.show()