### Analysis of the Saturday Night Live database

If you have downloaded the snl database you have the following files available:

* snl_season (sid, year)
* snl_episode (sid, eid, year, aired)
* snl_title (sid, eid, tid, title, titleType)
* snl_actor (aid, name, isCast)
* snl_actor_sketch (sid, eid, tid, aid, actorType)
* snl_rating (lots of rating data from IMDb)

In this notebook I want to have a first look at the data and show some interesting analysis that is possible with this dataset. Feel free to take your own look at it.

## Imports & setup

In [None]:
import pandas as pd
import numpy as np
import bokeh
from bokeh.io import output_notebook
from bokeh.plotting import figure, show
import datetime
output_notebook()

## Load the data

In [None]:
dfs = pd.read_csv('../input/snl_season.csv', encoding="utf-8")
dfe = pd.read_csv('../input/snl_episode.csv', encoding="utf-8",parse_dates=['aired'])
dft = pd.read_csv('../input/snl_title.csv', encoding="utf-8")
dfa = pd.read_csv('../input/snl_actor.csv', encoding="utf-8")
dfat = pd.read_csv('../input/snl_actor_title.csv', encoding="utf-8")
dfr = pd.read_csv('../input/snl_rating.csv', encoding="utf-8")

## Have a look at the data

In [None]:
dfs.head(2)

In [None]:
dfe.head(2)

In [None]:
dft.head(2)

In [None]:
dfa.head(2)

In [None]:
dfat.head(10)

In [None]:
dfr.head(2)

#### Combine episodes and ratings
Since the ratings are for the episode we combine the two dataframes.

In [None]:
dfer = pd.merge(dfe, dfr, on=['sid', 'eid'])

## Ratings over time (per episode)
Now we can create our first graph. Let us look at the ratings over time. First sort the dataframe by season and episode.

In [None]:
dfer = dfer.sort_values(['sid', 'eid'], ascending=[True, True]).reset_index(drop=True)

In [None]:
# plot a trend line, too
trend = np.polyfit(dfer.index, dfer["IMDb users_avg"].values, 10)
trend_func = np.poly1d(trend)

p = figure(plot_width=800, plot_height=200, y_range=(0,10))
r = p.multi_line([dfer.index, dfer.index],[dfer["IMDb users_avg"].values, trend_func(dfer.index)], color=['blue', 'red'])
t = show(p, notebook_handle=True)

## Ratings over time (per season)
It is also interesting to see how the average ratings of the season developed over the years.

In [None]:
sSeasonRatingAverage = dfer.groupby("sid")["IMDb users_avg"].mean()

In [None]:
p = figure(plot_width=800, plot_height=200, y_range=(0,10))
r = p.line(dfer.sid.unique(),sSeasonRatingAverage.values)
t = show(p, notebook_handle=True)

## Ratings over time (conclusion)

As you can see in the graphs there was a steep increase in quality between season 28 and 33. Since then the ratings are fairly constant. There were some quality highs in the mid 90s and 80s.

## Moving on to the actors
Now let us take a look at the actors. First it would be interesting to know which actors played in the most sketches and which of them were very present during their stay at the show (most sketches per episode). To do that we have to merge most of the dataframes.

In [None]:
dfactors = pd.merge(pd.merge(dfat, dfer, on=['sid', 'eid']), dfa, on='aid')

Now let's take a look at the Top 10 actors of SNL when it comes to appearances.

In [None]:
sActorsAppearances = dfactors.groupby('name')['sid'].count().sort_values(ascending=False)
sActorsAppearances.head(10)

The Top 3 are: Kenan Thompson, Phil Hartman and Darrell Hammond. Since Kenan is still on the show he can further increase his lead. But does he also have the most appearances per episode?

In [None]:
dfActorsEpisodes = pd.DataFrame(dfactors.groupby(['name','sid', 'eid'])['aid'].count().sort_values(ascending=False)).reset_index()
dfActorsEpisodes.head(10)

In this category there are four actors that take the first place: Ludacris, Richard Pryor, Ray Charles and Betty White. They were all part of 12 titles in a single episode. But which actor had the biggest presence on set over several episodes? Of course it only makes sense to look at actors who appeared in more than one episode.

In [None]:
# Define the aggregation calculations
aggregations = {
    'aid': {     # Now work on the "date" column
        'titles': 'sum',   # Find the max, call the result "max_date"
        'episodes': 'count'
    }
}
 
# Perform groupby aggregation by "month", but only on the rows that are of type "call"
dfActorsTitlePerEpisode = dfActorsEpisodes.groupby('name').agg(aggregations)
dfActorsTitlePerEpisode.columns = dfActorsTitlePerEpisode.columns.droplevel()

In [None]:
dfActorsTitlePerEpisode["title_avg"] = dfActorsTitlePerEpisode["titles"] / dfActorsTitlePerEpisode["episodes"]

Let's take a look at the actors with appearances in at least 3 episodes.

In [None]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=3].sort_values('title_avg', ascending=False).head(10)

Charles Barkley wins with 8.3 titles per episode. What about 10 episodes?

In [None]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=10].sort_values('title_avg', ascending=False).head(10)

Now let's look at people with at least 50 episodes under their belt. These are mostly cast members.

In [None]:
dfActorsTitlePerEpisode[dfActorsTitlePerEpisode.episodes>=50].sort_values('title_avg', ascending=False).head(10)

Here we see Phil Hartmans impressive record of having an average 5.6 titles per episode in over 160 episodes.

## End of the initial analysis
I hope I could spark your interest in this dataset. Maybe you have some ideas of interesting things to analyse about this TV show that is currently in its 42nd season. I will also add more data to this dataset if you point me towards a source of interesting data that would fit into it.

### What are the questions that can be answered from dataset?
1) Which pair of host/actor or host/singer or actor/singer had high ratings?
2) During the election period, how does ratings of the SNL show perform well? Which pair of host/actor/singer have been the best during the period?
3) Which part of the SNL show garners best ratings?
4) During different months of the SNL show, which episode garners the best audience ratings?

### Analysis of the Month wise performance of the SNL shows

In [None]:
dfer['Month']= dfer['aired'].apply(lambda x: 
                                    datetime.datetime.strptime(str(x),'%Y-%m-%d %H:%M:%S').strftime('%B'))

In [None]:
### By IMDB Ratings
dfer.groupby(['Month'])['sid','eid','aired','IMDb users_avg'].max()

In [None]:
### By US users_avg
dfer.groupby(['Month'])['sid','eid','aired','US users_avg'].max()

In [None]:
dfhosts=dfat.where(dfat.actorType=='host').head(1000)
dfcasts=dfat.where(dfat.actorType=='cast').head(1000)

In [None]:
dfhostsratings = pd.merge(dfhosts,dfr,on=["eid","sid"])
dfhostsratings = pd.merge(dfhostsratings,dfa,on=['aid'])
dfcastsratings = pd.merge(dfcasts,dfr,on=['eid','sid'])
dfcastsratings = pd.merge(dfcastsratings,dfa,on=['aid'])

In [None]:
dfhostsratings = dfhostsratings.sort_values("US users_avg", ascending=False).head(1000)
dfcastsratings = dfcastsratings.sort_values("US users_avg", ascending=False).head(1000)

In [None]:
dfhostcasts = pd.merge(dfhostsratings,dfcastsratings,on=["eid","sid"])

In [None]:
dfhostcasts.head(2)

In [None]:
dfhostcasts.groupby(['name_x','name_y']).count('eid').reset_index()