In [1]:
import zipfile
import pandas as pd
from kaggle.api.kaggle_api_extended import KaggleApi



### Kaggle:  The Office (US) - Complete Dialogue/Transcript 

We extract the data from an existing Kaggle dataset and we load it into a dataframe to do exploratory data analysis:

In [2]:
api = KaggleApi()
api.authenticate()

api.dataset_download_file('nasirkhalid24/the-office-us-complete-dialoguetranscript','The-Office-Lines-V4.csv', path='./data/kaggle_data')
with zipfile.ZipFile('./data/kaggle_data/The-Office-Lines-V4.csv.zip', 'r') as zip_ref:
    zip_ref.extractall('./data')



In [3]:
transcripts_df = pd.read_csv('data/The-Office-Lines-V4.csv')
transcripts_df = transcripts_df.drop('Unnamed: 6', axis=1)

transcripts_df.head()

Unnamed: 0,season,episode,title,scene,speaker,line
0,1,1,Pilot,1,Michael,All right Jim. Your quarterlies look very good...
1,1,1,Pilot,1,Jim,"Oh, I told you. I couldn't close it. So..."
2,1,1,Pilot,1,Michael,So you've come to the master for guidance? Is ...
3,1,1,Pilot,1,Jim,"Actually, you called me in here, but yeah."
4,1,1,Pilot,1,Michael,"All right. Well, let me show you how it's done."


Looking at the data, we can see that the speaker includes only the name of the character, and not the surname. Since the Dunderpedia data includes the character surnames, we will modify the `transcripts_df` speaker column to also include the surnames. This will later be useful in the analysis.

In [4]:
transcripts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54626 entries, 0 to 54625
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   season   54626 non-null  int64 
 1   episode  54626 non-null  int64 
 2   title    54626 non-null  object
 3   scene    54626 non-null  int64 
 4   speaker  54626 non-null  object
 5   line     54626 non-null  object
dtypes: int64(3), object(3)
memory usage: 2.5+ MB


The data downloaded consists of 54167 row and 6 attributes, namely: the season of the show, episode number for each dialogue, the character in the show saying the dialog and the line or dialogue the character says.

Next we make sure that our data does not contain missing values:

In [5]:
transcripts_df.isnull().sum()

season     0
episode    0
title      0
scene      0
speaker    0
line       0
dtype: int64

Looking at the data we could see that some lines are said by various speakers, in the format: `Jim & Pam`, `Dwight and Andy` or `Michael, Pam & Ryan`. For a more accurate analysis, we will separate these into different lines, one per speaker. 

In [6]:
transcripts_df.loc[transcripts_df['speaker'] == 'Michael, Pam & Ryan']

Unnamed: 0,season,episode,title,scene,speaker,line
24660,5,24,Heavy Competition,3908,"Michael, Pam & Ryan","U-G-L-Y, you ain't got no alibi! You ugly, huh..."


In [7]:
transcripts_df.loc[transcripts_df['speaker'] == 'Jim & Pam']

Unnamed: 0,season,episode,title,scene,speaker,line
29117,6,11,Shareholder Meeting,4477,Jim & Pam,No thanks.


In [8]:
transcripts_df.loc[transcripts_df['speaker'] == 'Dwight and Andy']

Unnamed: 0,season,episode,title,scene,speaker,line
24571,5,23,The Michael Scott Paper Company,3894,Dwight and Andy,"Take me home, country roads, to the place, I b..."
27785,6,6,Mafia,4291,Dwight and Andy,Oh man.


In [9]:
transcripts_df['speaker'].replace('and', ',', inplace=True, regex=True)
transcripts_df['speaker'].replace('&', ',', inplace=True, regex=True)

transcripts_df['speaker'] = transcripts_df['speaker'].str.split(',') # convert column to array

transcripts_df = transcripts_df.explode('speaker') # split array in multiple rows (one row per element)
transcripts_df['speaker'] = transcripts_df['speaker'].str.strip() # remove white spaces

transcripts_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54815 entries, 0 to 54625
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   season   54815 non-null  int64 
 1   episode  54815 non-null  int64 
 2   title    54815 non-null  object
 3   scene    54815 non-null  int64 
 4   speaker  54815 non-null  object
 5   line     54815 non-null  object
dtypes: int64(3), object(3)
memory usage: 2.9+ MB


Now we have 189 more rows than before, and all the lines are splited into individual speakers.

#### Save data to csv (to be used in the main notebook)

In [10]:
transcripts_df.to_csv("data/transcripts.csv", index=False)