## 1. Welcome!
<p><img src="https://assets.datacamp.com/production/project_1170/img/office_cast.jpeg" alt="Markdown">.</p>
<p><strong>The Office!</strong> What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.</p>
<p>In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: <code>datasets/office_episodes.csv</code>, which was downloaded from Kaggle <a href="https://www.kaggle.com/nehaprabhavalkar/the-office-dataset">here</a>.</p>
<p>This dataset contains information on a variety of characteristics of each episode. In detail, these are:
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/office_episodes.csv</b></div>
<ul>
    <li><b>episode_number:</b> Canonical episode number.</li>
    <li><b>season:</b> Season in which the episode appeared.</li>
    <li><b>episode_title:</b> Title of the episode.</li>
    <li><b>description:</b> Description of the episode.</li>
    <li><b>ratings:</b> Average IMDB rating.</li>
    <li><b>votes:</b> Number of votes.</li>
    <li><b>viewership_mil:</b> Number of US viewers in millions.</li>
    <li><b>duration:</b> Duration in number of minutes.</li>
    <li><b>release_date:</b> Airdate.</li>
    <li><b>guest_stars:</b> Guest stars in the episode (if any).</li>
    <li><b>director:</b> Director of the episode.</li>
    <li><b>writers:</b> Writers of the episode.</li>
    <li><b>has_guests:</b> True/False column for whether the episode contained guest stars.</li>
    <li><b>scaled_ratings:</b> The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).</li>
</ul>
    </div>

In [None]:
# Importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=[11,7]

# Importing the dataset
tv_data=pd.read_csv('../input/the-office/office_episodes.csv', parse_dates=['release_date'])

# Display the first five rows
display(tv_data.head())


## **Analyzing the Dataset**
we have already parsed the (<code>release_date</code>) but we can see that there are a lot of null values for the column(<code>guest_stars</code>). However, this is not due to the data being incomplete the reason is that not all episodes have a guest_star in them.
We are already provided with column (<code>scaled_ratings</code>) which is just rating but scaled for this analysis. Now we will be describing and summarising the data to see whether all columns have the appropriate data type or if the data needs to be cleaned now all that is left to do is to do the visualisations and draw the conclusions.

In [None]:
#Summary of the loaded dataset
tv_data.info()

In [None]:
# Describing the dataset using agregation
tv_data.describe()

## **Visualizing The Data**
From the (<code>.info()</code>) and (<code>.describe()</code>) method we can see that the data is cleaned and is ready for our analysis as there are no columns that needs to have their data type changed also there are not any other changes required to be made which are necessary for our analysis. Now after writing some code necessary for making the visualizations more appealing and easier to understand we can now finally go on and create the visualizations to complete our analysis.

In [None]:
#Creating a list of colors which shows different colors based on the rating
cols=[]

for i,r in tv_data.iterrows():
    if r['scaled_ratings']<.25:
        cols.append('red')
    elif r['scaled_ratings']>=.25 and r['scaled_ratings']<.5:
        cols.append('orange')
    elif r['scaled_ratings']>=.5 and r['scaled_ratings']<.75:
        cols.append('lightgreen')
    else:
        cols.append('darkgreen')

cols

In [None]:
#Specifying a list so the visualisation shows a larger size for episodes in which there were guests
size=[]

for i,r in tv_data.iterrows():
    if r['has_guests']==True:
        size.append(250)
    else:
        size.append(25)
size

In [None]:
#Adding new columns to the DataFrame which will help to create better visualisations
tv_data['colors']=cols
tv_data['size']=size

In [None]:
# Creating two DataFrames, one with guests appearances and one without guests appearances
tv_data_has_guests=tv_data[tv_data['has_guests']==True]
tv_data_no_guests=tv_data[tv_data['has_guests']==False]

In [None]:
#Visualising the data
fig=plt.figure()
plt.style.use('fivethirtyeight')

plot1=plt.scatter(data=tv_data_no_guests,
                  x="episode_number",
                  y="viewership_mil",
                  c='colors',
                  s='size')
plot2=plt.scatter(data=tv_data_has_guests,
                  x="episode_number",
                  y="viewership_mil",
                  c='colors',
                  s='size',
                  marker='*')
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()

## Understanding the Visualization 
The scatter plot shows the following colors for observations:  
i) Red- if the scaled ratings are less than 0.25  
ii) Orange- if the scaled ratings are more than 0.25 and less than 0.5  
iii) Light Green- if the scaled ratings are more than 0.5 and less than 0.75  
iv) Dark Green- if the scaled ratings are more than 0.75  
<p>Additionally, episodes which had guest appearances have a larger size and are represented with a star mark in the chart</p>
To end our analysis and deliver the conclusion we will now try to obtain a list of the guest stars who brought in the maximum viewership by appearing in an episode 



In [None]:
#Obtaining a filtered DataFrame which shows episode with highest viewrship
tv_data_most_watched=tv_data[tv_data['viewership_mil']==tv_data['viewership_mil'].max()]

In [None]:
top_stars=tv_data_most_watched['guest_stars']
top_stars

## **Conclusion**
From the chart we can analyze that most of the episodes with guest stars had a good rating for most of the episodes, however, some of them had a significantly good rating. Still there are quite a few episodes with just a safe rating even with guest stars appearing in them. An observation which is quite noticiable is the episode with viewership of more than 22.5 million, it might seem like an outlier caused by discrepancies in the data but it is in fact accurate. 

Finally, we have extracted a list of the guest stars which brought in most viewership to an episode and that concludes our analysis.

