# Visualization Course | Final Project
## **Movies Analysis**
#### *By Yana Kryshchuk*

### Data Description
Data is extracted from *data.world*. <br>

A data is categorized by the ten most popular movies, each year, for the years 1975-2015. Dataset includes: 

- Movie titles
- Poster URLs for each
- Genre information
- Run time
- MPAA ratings
- IMDB rating
- Rotten Tomato audience/critic rating
- Box office receipts (adjusted for inflation)Movie titles
- Poster URLs for each
- Genre information
- Run time
- MPAA ratings
- IMDB rating
- Rotten Tomato audience/critic rating
- Box office receipts (adjusted for inflation) <br>

Data was added: May 22, 2015

Let's import all the necessary libraries to work with the data.

In [1057]:
import pandas as pd
import altair as alt, datum


In [1058]:
df = pd.read_csv(
     "data.csv")
df.drop(['poster_url', '2015_inflation'], axis = 1, inplace = True)
df.head()

Unnamed: 0,audience_freshness,rt_audience_score,rt_freshness,rt_score,adjusted,genres,Genre_1,Genre_2,Genre_3,imdb_rating,length,rank_in_year,rating,release_date,studio,title,worldwide_gross,year
0,92.0,4.3,89.0,7.5,"$712,903,691.09",Sci-Fi\nAdventure\nAction,Sci-Fi,Adventure,Action,7.8,136.0,7.0,PG-13,4-Apr-14,Marvel Studios,Captain America: The Winter Soldier,"$714,766,572.00",2014.0
1,89.0,4.2,90.0,7.9,"$706,988,165.89",Sci-Fi\nDrama\nAction,Sci-Fi,Drama,Action,7.7,130.0,9.0,PG-13,11-Jul-14,20th Century Fox,Dawn of the Planet of the Apes,"$708,835,589.00",2014.0
2,93.0,4.4,91.0,7.7,"$772,158,880.00",Sci-Fi\nAdventure\nAction,Sci-Fi,Adventure,Action,8.1,121.0,3.0,PG-13,1-Aug-14,Marvel Studios,Guardians of the Galaxy,"$774,176,600.00",2014.0
3,86.0,4.2,72.0,7.0,"$671,220,455.10",Sci-Fi\nAdventure,Sci-Fi,Adventure,,8.7,169.0,10.0,PG-13,7-Nov-14,Paramount Pictures / Warner Bros.,Interstellar,"$672,974,414.00",2014.0
4,71.0,3.8,49.0,5.7,"$756,677,675.77",Family\nAdventure\nAction,Family,Adventure,Action,7.1,97.0,4.0,PG,30-May-14,Walt Disney Pictures,Maleficent,"$758,654,942.00",2014.0


At first we need to clean data and remove duplicates and similar names of studios which mean the same studio.

In [1059]:
studio_to_rep = {
    "Warner Bros.": "Warner Bros",
    "Paramount Pictures": "Paramount Pictures", 
    "Universal Pictures": "Universal Pictures",
    "20th Century Fox": "20th Century Fox",
    "Columbia Pictures": "Columbia Pictures",
    "Walt Disney Pictures": "Walt Disney Pictures", 
    "Paramount": "Paramount", 
    "Disney": "Disney",
    "TriStar Pictures": "TriStar Pictures",
    "Columbia": "Columbia",
    "United Artists": "United Artists",
    "Universal": "Universal Pictures",
    "Fox": "Fox",
    "DreamWorks": "DreamWorks",
    "Touchstone Pictures": "Touchstone Pictures",
    "New Line Cinema": "New Line Cinema",
    "Universal Studios": "Universal Pictures",
    "Orion Pictures": "Orion Pictures",
    "Touchstone": "Touchstone",
    "Warner Bros. Pictures": "Warner Bros",
    "Marvel Studios": "Marvel Studios",
    "MGM": "MGM",
    "United Artists Pictures": "United Artists Pictures",
    "Metro-Goldwyn-Mayer": "Metro-Goldwyn-Mayer",
    "DreamWorks Picture": "DreamWorks Picture",
    "Warner Bros": "Warner Bros",
    "Lionsgate": "Lionsgate",
    "20th Century Fox Film Corporation": "20th Century Fox",
    "Hollywood Pictures": "Hollywood Pictures",
    "Lucasfilm": "Lucasfilm",
    "Summit": "Summit",
    "American International Pictures": "American International Pictures",
    "IFC Films": "IFC Films",
    "Associated Film Distribution": "Associated Film Distribution",
    "Lionsgate Films": "Lionsgate Films",
    "Sunn Classic Pictures": "Sunn Classic Pictures",
    "20th Century-Fox Film Corporation": "20th Century Fox",
    "National Air and Space Museum": "National Air and Space Museum",
    "United Film Distribution Company": "United Film Distribution Company",
    "Buena Vista Pictures Distribution": "Buena Vista Pictures Distribution",
    "20th Century-Fox": "20th Century Fox",
    "Embassy Pictures": "Embassy Pictures",
    "New Line": "New Line",
    "Gramercy Pictures": "Gramercy Pictures",
    "Fox Searchlight Pictures": "Fox Searchlight Pictures",
    "Miramax Films": "Miramax Films",                         
    "Summit Entertainment": "Summit Entertainment",                  
    "Icon": "Icon",                                  
    "Walt Disney Productions": "Walt Disney Pictures"
}


In [1060]:
df['studio'] = df['studio'].map(studio_to_rep)
df['studio'] = df['studio'].str.split("/", n=5, expand=True)[0].str.strip()

### Correlation between ratings from different sources
There are 3 sources of reviews presented in the dataset. I want to show on linear chart whether there is correlation between these sources and whether we can consider all of them relevant. <br>
I removed lables of X axis because we do not need this information for comparing relations between scores, so that viewer would not be confused with extra data on chart. 

In [1061]:
rates = df[['rt_audience_score', 'rt_score', 'imdb_rating']]
rates.head()

Unnamed: 0,rt_audience_score,rt_score,imdb_rating
0,4.3,7.5,7.8
1,4.2,7.9,7.7
2,4.4,7.7,8.1
3,4.2,7.0,8.7
4,3.8,5.7,7.1


In [1062]:
df1 = df1.rename(columns = {'rt_score': 'RT score', 
                            'rt_audience_score': 'Audience score', 
                            'imdb_rating': 'IMDb rating'}, 
                 inplace = False)
 
domain = ['RT score', 'Audience score', 'IMDb rating']
range_ = ['#fd8d3c', '#74c476', '#3182bd']
    
alt.Chart(df1).mark_line().transform_fold(
    fold=['Audience score', 'RT score', 'IMDb rating'], 
    as_=['variable', 'value']
).encode(
    x=alt.X('title', 
            axis=alt.Axis(labels=False), 
            title=None),
    y=alt.Y('max(value):Q', 
            title = 'Scores'),
    color=alt.Color('variable:N', 
                    legend=alt.Legend(title="Score type", 
                                      titleFontSize=14), 
                    scale=alt.Scale(domain=domain, 
                                    range=range_)),
    tooltip = ['studio', 'title']
).properties(width = 950, 
             height = 500
).transform_filter(alt.FieldGTEPredicate(field = 'RT score', gte = 0.1)
).configure_area(interpolate = 'step')


We can make a conclusion that general picture shows that Score Trend is similar for all of the 3 resources, however RT score and IMdb rating are on average higher than Audience score.  

### Number of movies released by different studios through years (1975-2014)
I will visualize number of movis released by diffenent studios each year using stacked barchart diagram. The highest is part of bar related to certain studio, the more movies it released. Number of movies is shown on Y axis. <br>
**REMARK**: at first I have shown all the data on single chart, however this visualization is not convenient for viewer because there is too much studion on the legend and colors are overlaped. <br>
So I decided to split chart for 4 smaller charts and show only set of 10 studios out of 40 on each of them.

In [1063]:
studios = df[['year', 'studio']]
studios = studios.groupby(['year', 'studio']
                         ).agg({'studio': 'count'}
                         ).rename(columns={"studio": "counter"}
                         ).reset_index()

studios.head()

Unnamed: 0,year,studio,counter
0,1975.0,Columbia Pictures,3
1,1975.0,United Artists,1
2,1975.0,Universal Pictures,1
3,1975.0,Walt Disney Pictures,1
4,1975.0,Warner Bros,1


In [1064]:
st = df['studio']
st_list = st.unique().tolist()

In [1065]:
alt.Chart(studios).mark_bar(size=20).encode(
    x = alt.X('year:Q', 
              scale=alt.Scale(domain=[1975, 2014]), 
              axis=alt.Axis(formatType='number'), 
              title = 'Year'),
    y = alt.Y('counter', 
              aggregate='sum', 
              title = 'Number of released movies by all studios'),
    color = alt.Color('studio', scale=alt.Scale(scheme='tableau20'), 
                      legend=alt.Legend(title='Studio', 
                      titleFontSize=14)),
    tooltip = ['studio', 'counter'],
    opacity = alt.condition(
                      select_mov,
                      alt.value(1),
                      alt.value(0.6)

)).add_selection(select_mov
).properties(width = 950, 
             height = 500, 
             background = '#F9F9F9', 
             padding = 25, 
             title = "Number of movies released by different studios through years (1975-2014)")


### Improved version of visualisation

In [1066]:
select_mov = alt.selection_single(on = 'mouseover', nearest = False, empty = 'all')

In [1067]:
# Barchart 1

bar1 = alt.Chart(studios[studios['studio'].isin(st_list[:10])]).mark_bar(size=8.5).encode(
    x = alt.X('year:Q', 
              scale=alt.Scale(domain=[1974, 2018]), 
              axis=alt.Axis(formatType='number'), 
              title = 'Year'),
    y = alt.Y('counter', 
              aggregate='sum', 
              title = 'Number of released movies by all studios'),
    color = alt.Color('studio', scale=alt.Scale(scheme='tableau20'), 
                      legend=alt.Legend(title='Studio', 
                      titleFontSize=14)),
    tooltip = ['studio', 'counter'],
    opacity = alt.condition(
                      select_mov,
                      alt.value(1),
                      alt.value(0.6)))

# Barchart 2

bar2 = alt.Chart(studios[studios['studio'].isin(st_list[10:20])]).mark_bar(size=13).encode(
    x = alt.X('year:Q', 
              scale=alt.Scale(domain=[1985, 2013]), 
              axis=alt.Axis(formatType='number'), 
              title = 'Year'),
    y = alt.Y('counter', 
              aggregate='sum', 
              title = 'Number of released movies by all studios'),
    color = alt.Color('studio', scale=alt.Scale(scheme='tableau20'), 
                      legend=alt.Legend(title='Studio', 
                      titleFontSize=14)),
    tooltip = ['studio', 'counter'],
    opacity = alt.condition(
                      select_mov,
                      alt.value(1),
                      alt.value(0.6)))

# Barchart 3

bar3 = alt.Chart(studios[studios['studio'].isin(st_list[20:30])]).mark_bar(size=11.5).encode(
    x = alt.X('year:Q', 
              scale=alt.Scale(domain=[1974, 2003]), 
              axis=alt.Axis(formatType='number'), 
              title = 'Year'),
    y = alt.Y('counter', 
              aggregate='sum', 
              title = 'Number of released movies by all studios'),
    color = alt.Color('studio', scale=alt.Scale(scheme='tableau20'), 
                      legend=alt.Legend(title='Studio', 
                      titleFontSize=14)),
    tooltip = ['studio', 'counter'],
    opacity = alt.condition(
                      select_mov,
                      alt.value(1),
                      alt.value(0.6)))

# Barchart 4

bar4 = alt.Chart(studios[studios['studio'].isin(st_list[30:])]).mark_bar(size=34).encode(
    x = alt.X('year:Q', 
              scale=alt.Scale(domain=[1974, 1983]), 
              axis=alt.Axis(formatType='number'), 
              title = 'Year'),
    y = alt.Y('counter', 
              aggregate='sum', 
              title = 'Number of released movies by all studios'),
    color = alt.Color('studio', scale=alt.Scale(scheme='tableau20'), 
                      legend=alt.Legend(title='Studio', 
                      titleFontSize=14)),
    tooltip = ['studio', 'counter'],
    opacity = alt.condition(
                      select_mov,
                      alt.value(1),
                      alt.value(0.6)))

alt.vconcat(bar1, bar2, bar3, bar4).add_selection(select_mov
                                  ).configure_view(stroke=None
                                  ).resolve_scale(color='independent')
                                  

### Dependency between movie Duration, Profit and Rank in Year
With the bubble сhart I will show dependency between movies duration, profit and it's rank in year. Since all scores are correlated as I have shown on a chart above, I will use only IMdb rating. Size of each bubble reflects the rank in year. The bigger bubble - the higher rank. <br>
It is interesting to investigate which movies got the bigger profit and whether they have high ratings. Also, I want to check my assumption that movies shorter that 2 hours are not evaluated by viewers high.  

In [1068]:
df['imdb_rating'] = df['imdb_rating'].round()

def f(x):
    if (type(x) == float):
        return x
    return float(x.replace('$', '').replace(',', ''))
df['worldwide_gross'] = df['worldwide_gross'].apply(f)
df['adjusted'] = df['adjusted'].apply(f)
df.head()

Unnamed: 0,audience_freshness,rt_audience_score,rt_freshness,rt_score,adjusted,genres,Genre_1,Genre_2,Genre_3,imdb_rating,length,rank_in_year,rating,release_date,studio,title,worldwide_gross,year
0,92.0,4.3,89.0,7.5,712903700.0,Sci-Fi\nAdventure\nAction,Sci-Fi,Adventure,Action,8.0,136.0,7.0,PG-13,4-Apr-14,Marvel Studios,Captain America: The Winter Soldier,714766572.0,2014.0
1,89.0,4.2,90.0,7.9,706988200.0,Sci-Fi\nDrama\nAction,Sci-Fi,Drama,Action,8.0,130.0,9.0,PG-13,11-Jul-14,20th Century Fox,Dawn of the Planet of the Apes,708835589.0,2014.0
2,93.0,4.4,91.0,7.7,772158900.0,Sci-Fi\nAdventure\nAction,Sci-Fi,Adventure,Action,8.0,121.0,3.0,PG-13,1-Aug-14,Marvel Studios,Guardians of the Galaxy,774176600.0,2014.0
3,86.0,4.2,72.0,7.0,671220500.0,Sci-Fi\nAdventure,Sci-Fi,Adventure,,9.0,169.0,10.0,PG-13,7-Nov-14,,Interstellar,672974414.0,2014.0
4,71.0,3.8,49.0,5.7,756677700.0,Family\nAdventure\nAction,Family,Adventure,Action,7.0,97.0,4.0,PG,30-May-14,Walt Disney Pictures,Maleficent,758654942.0,2014.0


In [1069]:
alt.Chart(df
         ).mark_point( filled = True).encode(
        x = alt.X('length', 
                  title = 'Movie duration (minutes)', 
                  scale = alt.Scale(domain = [70, 200])),
        y = alt.Y('worldwide_gross', 
                  title = 'Profit ($)'),
                  color = alt.Color('imdb_rating', 
                          legend=alt.Legend(title='IMdb Rating', titleFontSize=14), 
                          scale=alt.Scale(scheme='cividis')),
                  size = alt.Size('rank_in_year:N', 
                          scale = alt.Scale(range = [100, 1200], reverse=True), 
                          legend=alt.Legend(title='Rank in year', 
                          titleFontSize=14)),
                  tooltip = ['studio', 'title', 'imdb_rating', 'rank_in_year'],
                  opacity = alt.condition(
                      select_mov,
                      alt.value(0.8),
                      alt.value(0.6)
)).add_selection(
    select_mov
).properties(
    width = 950, 
    height = 500, 
    background = '#F9F9F9', 
    padding = 25, 
    title = "Dependency between movie duration, profit and rank in year"
).transform_filter(
    alt.FieldGTEPredicate(field = 'length', gte = 60))


From this visualizatoin we can make various conclusions: <br>
1. IMdb rating is correlated with rank of movie in the year. You can see that general picture looks like this: the bigger is bubble (top ranks), the lighter it is (high scores) and vice versa. <br>
2. There are a few profit trends: 
    - Movies between 85 and 145 minutes length get on average profit 200 to 400 millions dollars. 
    - However, starting from 80 to 100 and from 95 to 160 minutes we can see that there are crear lines which go up and show that the longer is movie, the higher is profit.
    - Movies between 80 and 105 minutes have low ratings and are at the bottom of rank in the year and have the smallest profit. <br>
3. Also, there are not many movies longer than 170 minutes, but almost all of them have high ratings. <br>
4. All movies with the higherst profit (above 1.2 billions) are top 1 in the year and have almost the highest IMdb ratings.
    

### Data investigation
It was interesting to find out which genres of movies are in the dataset,since each movie figures in list of 3 different genres.

In [1070]:
genres1 = df['Genre_1']
genres2 = df['Genre_2']
genres3 = df['Genre_3']

g_list = genres1.unique().tolist()
g_list.extend(genres2.unique().tolist())
g_list.extend(genres3.unique().tolist())

g_list = set(g_list)
g_list

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western',
 nan}