# Overview - Get Started Early!

The project for this semester requires you to use Python to collect TV show ratings from the movie/show review site, IMDB. You will collect the ratings for each episode of each TV show from this list of shows. Read the instructions carefully.

http://www.imdb.com/search/title?languages=en%7C1&title_type=tv_series&num_votes=10000,&sort=num_votes,desc&page=1&ref_=adv_prv&page=1

Before collecting the data, grab the links from the first 500 (you're more than welcome to get more) tv shows on this list. Note you will need to loop through the pages, as only 50 are displayed on each page.

## Here is a minimum list of data that you will need to collect for each TV show. 

### Data per each episode
0. An identifier for each episode
1. The rating of each episode of the show
2. The number of ratings of each episode
3. The air date of each episode
5. The short description of each episode
6. The URL link to each episode
7. The runtime

### Data per show
1. Overall rating for the show
2. Total number of ratings
3. Number of votes for each rating level (1-10).
4. The demographic break down of ratings - the average and the total votes by age categories and gender.
5. IMDB Staff, Top 10k voters, US Users, and Non-US users' ratings and counts
6. The genre (main page)
7. The top plot keywords (main page)
8. The short storyline on the (main page)
9. List of main cast - the actor and actor ID (main page)
10. Main production companies (company credits)

# Midterm Deliverable
0. Your list of tv shows.
1. Working prototype code to grab all the data for a list of your shows. You do not need to have grabbed all 500, just need to show that your code can grab the data for as many shows as necessary - minimum data for 50 shows.
    1. Document your code, use markdown cells to describe large sections of code.
    2. Use comments to describe what you are trying to do with the code.
2. The data you collected. Note this could be several files, you might have a csv file for overall TV show data, and another dataset for the data by episode.

# Final Deliverable
0. Your code for the entire project. Documented similarly to the midterm.
1. Your final dataset.
2. Figures. You can put as many/few (min. 1) of these in your final report as you deem fit, but you need to make all of these.
    1. Code to generate a figure from any TV show in your dataset that includes:
        1. X axis: Season - Episode Number going from first season first episode to last season last episode.
        2. Y axis: Average rating
        2. Dot for Each season is a different color
        3. Title: Name of TV show and the run years.
        4. Dot label: Air date
    2. Generate a plot that shows the following:
        1. X axis: Year
        2. Y axis: Rating
        3. Time series (line plot) of: average ratings of all episodes aired that year, 25th percentile rating of all episodes rated that year, 75th percentile rating of all episodes rated that year.
    3. Create at least 3 other plots that you think provide some interesting insights about TV shows.
3. Tables
    1. Save a copy of the average and standard deviation of episode ratings for each TV show.
    2. Create a table that summarizes the following characteristics of TV shows by cohort year (the first year each show is aired)
        1. Average number of seasons, episodes, and total run time for each cohort.
        2. Average overall ratings for each cohort
        3. Average number of overall votes for each cohort
    3. Create a table that summarizes the following characteristics of TV shows by genre (use the first listed genre for each show).
        1. Average overall ratings by genre
        2. Average number of votes by genre
        3. Average length (seasons, episodes, and total run time) by genre.
    4. Create at least 2 other tables that provide some interesting insights.
4. Regressions
    1. For each season of each show, run a regression that summarizes the trajectory of the season. 
        1. First, compute a normalized "episode count" so you can compare TV shows with more episodes with those with less. For example, you can compute $\text{eppct} = \frac{\text{episode number}-1)}{\text{total episodes}-1}$ in season. In this example the first $\text{eppct} = 0$ and the last $\text{eppct} = 1$.
        2. Run a regression of episode ratings on the variable created above + that variable squared. 
        $$
        rating_{ep,s} = \beta_0+\beta_1\text{eppct}_{ep,s}+\beta_2\text{eppct}_{ep,s}^2
        $$
        The coefficient for the linear component ($\beta_1$) captures the linear trend and the squared component ($\beta_2$) caputres the concavity/convexity of the trend (review quadratic functions/parabolas if necessary). If it's negative, the tragectory is slowing down, if it's positive, the trajectory is increasing.
        3. Take the average of the linear and squared coefficients across the seasons of each show.
        4. Create a scatter plot of these coefficents as follows:
            1. x-axis: linear coefficient
            2. y-axis: quadratic coefficient
            3. Color each dot from red to green based on the average intercept coefficient for each TV show (Green(er) for higher coefficients, red(er) for lower coefficients).
            4. Note that green dots means the average quality of the TV show is higher (higher ratings). Are any of the quadrants of the plot filled with more green or red dots?
    2. Run one regression to explain the overall rating of the TV show based on first season's average rating, last season's average rating, median episode rating, standard deviation of episode ratings, and the average trajectory coefficients (linear and squared).
        1. Run this regression by demographic category. Do different demographics (male vs. female, old vs. young) rate the overall show differently depending on the show's average/median/s.d. episode ratings and show trajectory? 
    3. Run at least 2 more regressions that you think reveal something interesting about the TV shows.
5. Report - You will turn in a report that documents what you've learned from this project about TV shows. Please schedule a meeting with me after you have 1 draft completed at least 1 week prior to the due date. Note you do not have to have all your analysis completed. This is to make sure that the writing of your report is on the right track. It will include the following sections:
    1. Abstract/ Executive Summary
        1. This will introduce briefly (no more than 3 paragraphs) what you've found about TV shows and how you found this out.
        2. You should make some recommendations to TV show producers about what they should take into account when producing future shows based on your findings.
    2. Data
        1. Introduce how you collected the data (data source, be specific - which pages you used, how you went about collecting the data/figuring out which pages to scrape).
        2. Describe the dataset.
            1. Include some summary statistics (this can be a table or figure) about TV shows in your dataset.
            2. Maybe include things such as: 
                1. number of tv shows
                2. number of episodes
                3. average ratings per episode
                4. average ratings by TV-show
    3. Analyses
        1. Write a different subsection for each major analysis that you did, if several are similar or show similar results, group them together.
        2. Motivate each analysis. Describe what the analysis is analyzing.
        3. Say what you did for each analysis. Be specifc. If you created a new variable, state how it is defined. If you ran a regression, state the regression equation. Describe why you did what you did.
        4. Report the results of the analyses (figure/table) inside the text. If the table is too large, reference/link to a file.
        5. Describe what conclusions you could (or couldn't) make based on the analysis.
    4. Conclusions
        1. Organize your findings in a logical manner.
        2. State what you found based on the analyses.
        3. State what TV show executives can take away from your analyses.
    5. References
        1. List any sources you cited in the report (such as data sources), articles to help motivate your analyses and/or report (such as [news articles](https://www.ft.com/content/68309b3a-1f02-11e7-a454-ab04428977f9) describing how we are in the "Golden Age of TV shows.")
        2. Follow APA formatting. Use the Microsoft Word citation manager. Remember that any citation you have in the reference page must be used as an in text citation and vice versa.

# Send me emails / come to my office hours if you have any questions along the way
        