<img src="https://github.com/simonwiles/colab_workshops/raw/master/cidr-logo.no-text.240x140.png" style="display:block;margin:0 auto;" alt="Center for Interdisciplinary Digital Research @ Stanford"/>
<H1 align="center">Digital Tools and Methods for the Humanities and Social Sciences</H1>
<H2 align="center">Data Manipulation with Python</H2>

### Instructors
- Scott Bailey (CIDR), <em>scottbailey@stanford.edu</em>
- Simon Wiles (CIDR), <em>simon.wiles@stanford.edu</em>

### Goal
By the end of this workshop, we hope you'll be able to load in data into a Pandas `DataFrame`, perform basic cleaning and analysis, and create visualizations of some relevant aspects of a dataset.  For most of this workshop we will work with a dataset prepared from the [IMDb Datasets](https://www.imdb.com/interfaces/) and the [OMDb API](https://www.omdbapi.com/).

### Topics
- Pandas Series and DataFrame
- Loading data in, null and missing data
- Describing data
- Column manipulation
- String manipulation
- Split-Apply-Combine
- Plotting:
  - Basic charts (line, bar, pie)
  - Histograms
  - Scatter plots
  - Boxplots, violinplots
  
<mark>TODO: the above is not really reflective!</mark>

### Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop. Colaboratory is a cloud-based platform that allows you ~to create libraries, which are effectively project folders and virtual environments that can contain static files and Python notebooks. They come with a number of popular libraries pre-installed, and allow you to install other libraries as needed.~

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we have some instructions (with gifs!) on installing Python through the Anaconda distribution, which will also help you handle virtual environments: https://github.com/sul-cidr/python_workshops/blob/master/setup.ipynb <mark> ← TODO: migrate this to a wiki page on the CIDR Workshops repo</mark>

If you run into problems, or would like to look into other ways of installing Python or handling virtual environments, feel free to send us an email (contact-cidr@stanford.edu) or visit us during our [consulting hours](https://library.stanford.edu/research/cidr/consulting).

~For now, go ahead to https://notebooks.azure.com and login with your Stanford ID and password.~

### Environment
If you would prefer to use Anaconda or their own local installation of python or Jupyter Notebooks, for this workshop you will need an environment with the following packages installed and available:
- `pandas`
- `matplotlib`
- `requests`
- `sqlalchemy`
- `seaborn` (available in the `conda-forge` channel)

Please note that we will likely not have time during the workshop to support you with problems related to a local environment, and we do recommend using the Colaboratory notebooks if you are at all unsure.

###  Copying this notebook
~Go to https://notebooks.azure.com/versae/libraries/cidr-data-manipulation~
    
~From there, click "Clone" to create a full copy of this library.~

## 1. What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney.  The name is derived from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do) (from which some code excerpts are used in this workshop):

> Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a `DataFrame`. `DataFrame`s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

### 1.1. What does Pandas *do*?

<mark>TODO: wip</mark>
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data


... but perhaps we should let Pandas introduce itself:

In [58]:
import pandas as pd

# The two lines below configure how our outputs are shown in this notebook
#  environment.  They need not concern us now.
#pd.set_option('display.max_rows', 20)
#pd.DataFrame._repr_html_ = lambda self: '<style>table.dataframe td {white-space: nowrap}</style>' + self.to_html(max_rows=10, notebook=True)

In [None]:
pd?

### 1.2. Where can I get more help with Pandas?

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow](https://stackoverflow.com/questions/tagged/pandas) has a "pandas" tag, and there is also a (much younger, much smaller) sister [site dedicated to Data Science questions](https://datascience.stackexchange.com/questions/tagged/pandas) that has a "pandas" tag too.

In [None]:
pd.isnull?

## 2. Introduction to `DataFrame`s and `Series`

The main data structure that Pandas implements is the `DataFrame`, and a `DataFrame` is composed of one or more `Series` and, optionaly, an `Index`.  

A `DataFrame` is a two-dimensional array with flexible row indices and flexible column names. It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data.

A `Series` is a one-dimensional array of indexed data. It can be thought of as a specialized dictionary or a generalized NumPy array.

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.


<mark>**GRAPHIC HERE**</mark>

## 3. Creating `DataFrame`s and loading data

There are a great many ways to create a Pandas `DataFrame` -- we can build one ourselves in lower-level Python datatypes, of course, but Pandas also provides methods to load data in from common storage and serialization formats.

<a title="PerryPlanet [Public domain], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Bayarea_map.svg" style="float:right"><img width="256" alt="Bayarea map" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Bayarea_map.svg/512px-Bayarea_map.svg.png"></a>
### 3.1. Introduction to `DataFrame`s

The simplest way to generate a `DataFrame` is to create it directly from a `dict` of `list`s:

In [51]:
data = {
    "county": ["Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma"],
    "county seat": ["Oakland", "Martinez", "San Rafael", "Napa", "San Francisco", "Redwood City", "San Jose", "Fairfield", "Santa Rosa"],
    "population": [1494876, 1037817, 250666, 135377, 870887, 711622, 1762754, 411620, 478551],
    "area": [2130, 2080, 2140, 2040, 600.59, 1930, 3380, 2350, 4580]
}
bay_area_counties = pd.DataFrame(data)
bay_area_counties

Unnamed: 0,county,county seat,population,area
0,Alameda,Oakland,1494876,2130.0
1,Contra Costa,Martinez,1037817,2080.0
2,Marin,San Rafael,250666,2140.0
3,Napa,Napa,135377,2040.0
4,San Francisco,San Francisco,870887,600.59
5,San Mateo,Redwood City,711622,1930.0
6,Santa Clara,San Jose,1762754,3380.0
7,Solano,Fairfield,411620,2350.0
8,Sonoma,Santa Rosa,478551,4580.0


Pandas has automatically created an `Index` on this `DataFrame` ([0..8]), but we can also specify our own `Index` when we instantiate the frame ourselves:

In [None]:
bay_area_counties = pd.DataFrame(data, index=["Ala", "Con", "Mar", "Nan", "SF", "SM", "SC", "Sol", "Son"])
bay_area_counties

This allows us to `loc`ate a specific reference using the key in the `Index`:

In [None]:
bay_area_counties.loc['Ala']

We can also set an `Index` at any time after the `DataFrame` has been created, either by adding a new index:

In [None]:
bay_area_counties = pd.DataFrame(data)
bay_area_counties.index = ["Ala", "Con", "Mar", "Nan", "SF", "SM", "SC", "Sol", "Son"]
bay_area_counties

or by choosing one of the existing columns to become the index: <mark>(note the use of `inplace=True`)</mark>

In [None]:
bay_area_counties = pd.DataFrame(data)
bay_area_counties.set_index('county', inplace=True)
bay_area_counties

In [None]:
bay_area_counties.loc['Santa Clara']

### 3.2. Reading data from persistent storage

However, most of the time we're more likely to be reading data in from an external source of some kind, and Pandas has us well covered here.

First, let's grab some data into our Colaboratory Notebook environment so that we can work with it locally:

In [None]:
!mkdir -p workshop_data
!wget https://raw.githubusercontent.com/simonwiles/colab_workshops/master/sample_data/imdb_top_1000.csv -O workshop_data/imdb_top_1000.csv

#### 3.2.1. CSV files
Reading in data from CSV files is as simple as:

In [None]:
data_frame = pd.read_csv('workshop_data/imdb_top_1000.csv')
data_frame

Notice again that Pandas has created a default `Index` for this `DataFrame` -- we probably want the `imdbID` column to be the `Index`, and we can set that after import, as above, or we can specify it when loading the CSV initially:

In [None]:
data_frame = pd.read_csv('workshop_data/imdb_top_1000.csv', index_col='imdbID')
data_frame.head(4) # the `head` method defaults to five if called without an argument
                   # a `tail` method is also available with the same semantics

#### 3.2.2. Reading data from JSON Files

JSON files can be loaded in a similarly straightforward way.

There are two things to note here:

1. the nature of JSON as a file format is such that the `Index` is explicit, and Pandas will set it correctly for us initially.
2. We're loading the data directly over HTTP(S) here -- Pandas `read_...` methods can accept a local file path or a URL, and Pandas will take care of fetching the data for you.

In [54]:
pd.read_json('https://raw.githubusercontent.com/simonwiles/colab_workshops/master/sample_data/imdb_top_1000.json')

Unnamed: 0,Title,Year,Rated,Runtime,Genre,Director,Actors,Plot,Language,Country,Awards,Poster,Metascore,imdbRating,imdbVotes,BoxOffice,Production,RottenTomatoes
tt0111161,The Shawshank Redemption,1994,R,142,Drama,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Two imprisoned men bond over a number of years...,English,USA,Nominated for 7 Oscars. Another 19 wins & 32 n...,https://m.media-amazon.com/images/M/MV5BMDFkYT...,80.0,9.3,2139341,,Columbia Pictures,90%
tt0068646,The Godfather,1972,R,175,"Crime, Drama",Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Richard ...",The aging patriarch of an organized crime dyna...,"English, Italian, Latin",USA,Won 3 Oscars. Another 24 wins & 28 nominations.,https://m.media-amazon.com/images/M/MV5BM2MyNj...,100.0,9.2,1469958,,Paramount Pictures,98%
tt7286456,Joker,2019,R,121,"Crime, Drama, Thriller",Todd Phillips,"Joaquin Phoenix, Robert De Niro, Zazie Beetz, ...","A gritty character study of Arthur Fleck, a ma...",English,"USA, Canada",,https://m.media-amazon.com/images/M/MV5BNGVjNW...,58.0,9.1,87202,,Warner Bros. Pictures,77%
tt0468569,The Dark Knight,2008,PG-13,152,"Action, Crime, Drama, Thriller",Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",When the menace known as The Joker emerges fro...,"English, Mandarin","USA, UK",Won 2 Oscars. Another 152 wins & 155 nominations.,https://m.media-amazon.com/images/M/MV5BMTMxNT...,84.0,9.0,2101295,"$533,316,061",Warner Bros. Pictures/Legendary,94%
tt5813916,The Mountain II,2016,,135,"Action, Drama, War",Alper Caglar,"Ozan Agaç, Bedii Akin, Murat Arkin, Eylül Arular",In a desolate war zone where screams of the in...,Turkish,Turkey,,https://m.media-amazon.com/images/M/MV5BMDdkMj...,,9.0,101514,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
tt1563738,One Day,2011,PG-13,107,"Drama, Romance",Lone Scherfig,"Anne Hathaway, Jim Sturgess, Tom Mison, Jodie ...",After spending the night together on the night...,English,"USA, UK",1 win & 2 nominations.,https://m.media-amazon.com/images/M/MV5BMTQ3NT...,48.0,7.0,129900,"$13,766,014",Focus Features,36%
tt0109831,Four Weddings and a Funeral,1994,R,117,"Comedy, Drama, Romance",Mike Newell,"Hugh Grant, James Fleet, Simon Callow, John Ha...","Over the course of five social occasions, a co...","English, British Sign Language",UK,Nominated for 2 Oscars. Another 24 wins & 23 n...,https://m.media-amazon.com/images/M/MV5BMTMyNz...,81.0,7.0,128874,,Gramercy Pictures,96%
tt1862079,Safety Not Guaranteed,2012,R,86,"Comedy, Drama, Romance",Colin Trevorrow,"Aubrey Plaza, Lauren Carlos, Basil Harris, Mar...",Three magazine employees head out on an assign...,English,USA,7 wins & 18 nominations.,https://m.media-amazon.com/images/M/MV5BOWU3ZD...,72.0,7.0,113015,"$4,007,792",Film District,90%
tt5439796,Logan Lucky,2017,PG-13,118,"Comedy, Crime, Drama",Steven Soderbergh,"Farrah Mackenzie, Channing Tatum, Jim O'Heir, ...",Two brothers attempt to pull off a heist durin...,English,USA,2 wins & 9 nominations.,https://m.media-amazon.com/images/M/MV5BMTYyOD...,78.0,7.0,109785,"$27,696,504",Fingerprint Releasing / Bleecker Street,92%


#### 3.2.3. Reading data via a SQL query

We can also load data from relational databases or other datastores that export a SQL-compatible interface.  For this example we'll download a simple SQLite database file to operate on, but Pandas’ `read_sql*` methods can accept a `connection` object that is predicated on a remote database server if required.

In [None]:
!pip install sqlalchemy
!mkdir -p workshop_data
!wget https://raw.githubusercontent.com/simonwiles/colab_workshops/master/sample_data/imdb_top_1000.sqlite -O workshop_data/imdb_top_1000.sqlite

In [None]:
from sqlalchemy import create_engine
engine = create_engine("sqlite:///workshop_data/imdb_top_1000.sqlite", echo=False)
pd.read_sql_query("SELECT * FROM imdb_top_1000;", con=engine, index_col='imdbID')

#### 3.2.3. Other input formats

Pandas also has methods that allow it to read data directly from other formats, including those used by Microsoft Excel, Stata, SAS, and Google Big Query.  More details are available from the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

### 3.3. Writing `DataFrames` back out to persistent storage

Pandas makes writing data to persistent storage formats similarly convenient.  Methods are available to write to most of the formats Pandas can read, including all those demonstrated above.

In [None]:
data_frame.to_csv('imdb_data_2.csv')

## 4. Working with `DataFrame`s

Let's begin by loading our dataset of the top 1,000 ranked films on imDB.

<mark>In the same way that it's common to `import pandas as pd`, it's common to use `df` as an identifier for generic `DataFrame`s, especially in tutorials and demos.  For anything other than interactive sessions or throw-away scripts, I strongly recommend using good descriptive identigiers for your `DataFrame`s!</mark>

In [60]:
df = pd.read_csv('https://raw.githubusercontent.com/simonwiles/colab_workshops/master/sample_data/imdb_top_1000.example.csv', index_col='imdbID')
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1014 entries, tt0111161 to tt0472062
Data columns (total 18 columns):
Title             1014 non-null object
Year              1014 non-null int64
Rated             1013 non-null object
Runtime           1014 non-null int64
Genre             993 non-null object
Director          1014 non-null object
Actors            1014 non-null object
Plot              1014 non-null object
Language          1014 non-null object
Country           1014 non-null object
Awards            975 non-null object
Poster            1014 non-null object
Metascore         989 non-null float64
imdbRating        1014 non-null float64
imdbVotes         1013 non-null float64
BoxOffice         561 non-null object
Production        999 non-null object
RottenTomatoes    996 non-null object
dtypes: float64(3), int64(2), object(13)
memory usage: 150.5+ KB


In [7]:
df.Rated.unique()

array(['R', 'PG-13', nan, 'Not Rated', 'PG', 'G', 'Passed', 'Approved',
       'GP', 'Unrated', 'NC-17'], dtype=object)

In [8]:
df.Rated.value_counts()

R            509
PG-13        267
PG           139
G             47
Not Rated     39
Passed         5
NC-17          4
Approved       1
Unrated        1
GP             1
Name: Rated, dtype: int64

In [21]:
df.imdbVotes.describe()

count    1.013000e+03
mean     3.280381e+05
std      2.707299e+05
min      8.720200e+04
25%      1.471990e+05
50%      2.299110e+05
75%      4.090420e+05
max      2.139341e+06
Name: imdbVotes, dtype: float64

### 4.1. Indexing and slicing
Accessing columns can be done using the dot notation, `df.column_name`, or the dictionary notation, `df['column_name']`.  This returns a `Series` object.

In [None]:
df.Title

In [None]:
type(df.Title)

`DataFrame`s can be sliced to return a subset of the available columns -- this returns a new `DataFrame` object.

In [None]:
df[['Title', 'Director']]

In [None]:
type(df[['Title', 'Director']])

`Series` object and other subsets of `DataFrame`s preserve the indices of the `DataFrame` from which they are derived, which makes further operations such as merging or columns manipulation possible.

`DataFrame`s are designed to operate at the column level, not at the row level. However, a subset of rows can be visualized easily using a slice like in any Python list.

In [None]:
df[10:15]

We can chain these operations together, as long as we remember what we are returning in each link of the chain.

In [None]:
df[10:15].Production

In [None]:
df.Production[10:15]

In [None]:
df[['Production']][10:15]

In [None]:
df[["Title", "Plot"]].iloc[2:5]

In [None]:
df[["Title", "Plot"]].loc['tt7286456':'tt5813916']

<mark>Notice how in then `.iloc` example, record #5 is omitted, in usual python slice fashion, but in the `.loc` example, the observation with `imdbID == tt5813916` is *included* in the result.</mark>

#### 4.1.1 Activity

Given the `DataFrame` defined above, write an expression to extract a `DataFrame` with the columns `Title`, `Year`, `Director`, and `imdbRating`. Show only the first 5 rows of it.

In [None]:
# Write your code here

### 4.2. Expressions
<mark>TODO: intro.</mark>

Operations performed on a column, or `Series`, are broadcast to each of the elements.

In [None]:
# convert the running times to hours
df.Runtime / 60

Simple string concatenation can be performed in the same manner:

In [None]:
df.Title + '(' + df.Rated + '), directed by: ' + df.Director

as are boolean operations:

In [None]:
df.Year < 2000

By itself this is not terribly useful, but expressions of this kind are very powerful when passed to a `DataFrame` to select content:

In [None]:
df[df.Production == 'Paramount Pictures']

In [None]:
df[df.Year < 2000]

Any expression that evaluates to a `Series` of boolean values (`True` or `False`) and shares the same `Index` as the source `DataFrame` can be used.  Complex conditions can be assembled using bitwise logical operators `&`, `|`, and `~` to create simple but powerful filters.

In [None]:
# returns a `Series`
df[(df.Year < 2000) & (df.imdbRating > 8)].Title

In [None]:
# returns a `DataFrame`
df.loc[(df.Year < 2000) & (df.imdbRating > 8), ['Title', 'Year', 'imdbRating']]

The `.str` property gives access to string variables in a broadcast fashion, such that they can be manipulated:

In [None]:
df.Actors.str.split(', ')

We can make use of `.str` in expressions:

In [None]:
# select records with Oscar nominations or wins
df[df.Awards.str.contains('Oscar') == True]

In [None]:
df[df.Director.apply(lambda a: len(a.split(', ')) > 3)]

#### 4.2.1. Activity

Returning to our “Bay Area Counties” data from earlier, write an expression to calculate the population density of each county.

In [None]:
data = {
    "county": ["Alameda", "Contra Costa", "Marin", "Napa", "San Francisco", "San Mateo", "Santa Clara", "Solano", "Sonoma"],
    "county seat": ["Oakland", "Martinez", "San Rafael", "Napa", "San Francisco", "Redwood City", "San Jose", "Fairfield", "Santa Rosa"],
    "population": [1494876, 1037817, 250666, 135377, 870887, 711622, 1762754, 411620, 478551],
    "area": [2130, 2080, 2140, 2040, 600.59, 1930, 3380, 2350, 4580]  # data is in km²
}
bay_area_counties = pd.DataFrame(data)
with pd.option_context('display.max_rows', 10):
    display(bay_area_counties)

In [None]:
# Write your code here

### 4.3 Cleaning and manipulating `DataFrame`s

<mark>TODO: something on basic assignment to `DataFrame`s?</mark>

The fundamental way of manipulating the contents of `DataFrame`s is by using the `apply()` method.  `apply()` takes a function as an argument, and `apply`s the function to each element in the container it’s called on.

In [28]:
def count_genres_naive(genres_text):
    genres = genres_text.split(", ")
    return len(genres)

In [24]:
df.Genre

imdbID
tt0111161                                Drama
tt0068646                         Crime, Drama
tt7286456               Crime, Drama, Thriller
tt0468569       Action, Crime, Drama, Thriller
tt5813916                   Action, Drama, War
                           ...                
tt1563738                       Drama, Romance
tt0109831               Comedy, Drama, Romance
tt1862079               Comedy, Drama, Romance
tt5439796                 Comedy, Crime, Drama
tt0472062    Biography, Comedy, Drama, History
Name: Genre, Length: 1014, dtype: object

In [31]:
df.Genre.apply(count_genres_naive)

AttributeError: 'float' object has no attribute 'split'

Unfortunately our `count_genres_naive` function does not know how to handle missing data.  (Pandas uses the `NaN` datatype from the underlying `numpy` packages to represent missing data, and this datatype is based on the primitive `float` type, which is why the `Attribute` error reads as it does.  Note that if the missing data was represented as `None`, the same problem would arise, and if it was represented by the empty string `''` it would be worse, as `len(''.split(',')) == 1`!)

We could handle this problem in a number of ways:
1. we could modify our `count_genres_naive` function to handle the missing values;
2. we could drop the observations with missing values immediately (and temporarily) before we apply our count function; or
3. we could clean the dataset when we initially import it.

In [32]:
df.Genre.dropna()

imdbID
tt0111161                                Drama
tt0068646                         Crime, Drama
tt7286456               Crime, Drama, Thriller
tt0468569       Action, Crime, Drama, Thriller
tt5813916                   Action, Drama, War
                           ...                
tt1563738                       Drama, Romance
tt0109831               Comedy, Drama, Romance
tt1862079               Comedy, Drama, Romance
tt5439796                 Comedy, Crime, Drama
tt0472062    Biography, Comedy, Drama, History
Name: Genre, Length: 993, dtype: object

In [35]:
df.Genre.dropna().apply(count_genres_naive)

imdbID
tt0111161    1
tt0068646    2
tt7286456    3
tt0468569    4
tt5813916    3
            ..
tt1563738    2
tt0109831    3
tt1862079    3
tt5439796    3
tt0472062    4
Name: Genre, Length: 993, dtype: int64

In [61]:
df['genre_count'] = df.Genre.dropna().apply(count_genres_naive)
df[pd.isna(df.Genre)][['Genre', 'genre_count']]

Unnamed: 0_level_0,Genre,genre_count
imdbID,Unnamed: 1_level_1,Unnamed: 2_level_1
tt0057012,,
tt0079470,,
tt0072431,,
tt0091042,,
tt1119646,,
...,...,...
tt0109686,,
tt0424345,,
tt0357413,,
tt0302886,,


One problem with this approach is that we 

In [33]:
df.isnull().sum()

Title               0
Year                0
Rated               1
Runtime             0
Genre              21
Director            0
Actors              0
Plot                0
Language            0
Country             0
Awards             39
Poster              0
Metascore          25
imdbRating          0
imdbVotes           1
BoxOffice         453
Production         15
RottenTomatoes     18
dtype: int64