## Data cleaning with pandas

This is a work in progress notebook. The idea is to create a basic tutorial on data cleaning. I will use mainly python and eventually some library that allows me to get colors from images (since I have movie covers).

**Possible tasks:**
- Clean and consolidate data columns, remove % mark from score and strings from release date.
- Get main colors from cover images
- Create new column holding cover color information

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

#### Load and preview data

We can easily open a csv file by specifying its directory. If we use `pd.read_csv` we also automatically convert our file into a Dataframe that can be later manipulated. We can also display data very conveniently by using `.sample()`, `.head()` or `.tail()`.

In [19]:
#Reading files
allTitles = pd.read_csv('data/titlesDataRaw_extended.csv', sep=",")

# Print a sample of rows
allTitles.sample(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
48,48,A Chef'S Voyage,67%,Available Nov 24,data/img/poster-49.jpg
17,17,The Prom,57%,Available Dec 11,data/img/poster-18.jpg
66,66,Wolfman'S Got Nards,100%,Available Oct 27,data/img/poster-67.jpg
267,267,I Used To Go Here,85%,Available Aug 7,data/img/poster-268.jpg
21,21,Max Cloud,50%,Available Dec 18,data/img/poster-22.jpg


In [20]:
# Print the first 5 rows in the dataframe
allTitles.head(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
0,0,The Emoji Story (Picture Character),91%,Available Dec 22,data/img/poster-1.jpg
1,1,Yellow Rose,86%,Available Jan 5,data/img/poster-2.jpg
2,2,Soul,96%,Available Dec 25,data/img/poster-3.jpg
3,3,Sing Me A Song,88%,Available Jan 1,data/img/poster-4.jpg
4,4,Pieces Of A Woman,77%,Available Jan 7,data/img/poster-5.jpg


In [21]:
# Print the last 5 rows in the dataframe
allTitles.tail(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
309,309,The Silencing,17%,Available Jul 16,data/img/poster-310.jpg
310,310,Showbiz Kids,96%,Available Jul 14,data/img/poster-311.jpg
311,311,Deep Blue Sea 3,71%,Available Jul 28,data/img/poster-312.jpg
312,312,Helmut Newton: The Bad And The Beautiful,69%,Available Jul 24,data/img/poster-313.jpg
313,313,Host,100%,Available Jul 30,data/img/poster-314.jpg


#### Adapt data structure

We can drop, sort, replace, create or rename columns very easily.

In [22]:
# Drop the index column that I accidentally exported when scraping 
#(it can be easily avoided by setting Index=False when exporting the csv)
allTitles = allTitles.drop(['Unnamed: 0'], axis=1)

In [23]:
# Print column names
allTitles.columns

Index(['title', 'rating', 'available', 'img'], dtype='object')

In [24]:
# Rename columns
allTitles.columns = ['title', 'rating_perc', 'available_date', 'img_path']

In [25]:
# Reorder columns
allTitles = allTitles[['title', 'available_date', 'rating_perc', 'img_path']]

In [26]:
allTitles.columns

Index(['title', 'available_date', 'rating_perc', 'img_path'], dtype='object')

#### Remove pieces of strings

Lots of pandas methods automatically iterate over rows without using for loops (which is possible using `.iterrows()`, but not reccommended). This method is called `.apply()` and allows us to pass anonymous or lambda functions to each row in our DataFrame. In most of the cases `.apply()` is comparable to the python native `.map()`. Compact iterations are useful in data cleaning operations, especially when we have big DataFrames.

In [31]:
# Right strip all the rows for the column 'rating_perc' (using .apply())
allTitles['rating_perc'] = allTitles['rating_perc'].apply(lambda x: x.rstrip('%'))

In [32]:
# Left strip all the rows for the column 'available_date' (using .map())
allTitles['available_date'] = allTitles['available_date'].map(lambda x: x.lstrip('Available '))

In [33]:
# We can check our result by printing the unique values for a particular column.
allTitles['available_date'].unique()

array(['Dec 22', 'Jan 5', 'Dec 25', 'Jan 1', 'Jan 7', 'Dec 11', 'Dec 15',
       'Dec 23', 'Dec 18', 'Jan 12', 'Dec 30', 'Dec 29', 'Jan 8',
       'Dec 21', 'Nov 27', 'Nov 20', 'Nov 13', 'Dec 1', 'Nov 17',
       'Nov 24', 'Nov 25', 'Dec 4', 'Nov 3', 'Oct 27', 'Nov 6', 'Nov 10',
       'Oct 23', 'Oct 30', 'Oct 28', 'Oct 26', 'Nov 11', 'Nov 1', 'Nov 5',
       'Oct 20', 'Oct 13', 'Oct 16', 'Oct 6', 'Oct 15', 'Oct 9', 'Oct 21',
       'Sep 29', 'Oct 2', 'Sep 22', 'Sep 18', 'Sep 30', 'Oct 1', 'Sep 25',
       'Sep 20', 'Sep 11', 'Sep 8', 'Sep 4', 'Sep 15', 'Sep 17', 'Sep 14',
       'ug 21', 'ug 28', 'ug 25', 'Sep 1', 'ug 12', 'ug 14', 'ug 18',
       'ug 8', 'ug 11', 'ug 20', 'ug 7', 'ug 4', 'ug 3', 'Jul 31',
       'Jul 28', 'Jul 14', 'Jul 24', 'Jul 17', 'Jul 10', 'Jul 21',
       'Jul 16', 'Jul 30'], dtype=object)

In [10]:
# We can replace parts of strings if we notice that something went wrong in our cleaning.
allTitles['available_date'] = allTitles['available_date'].str.replace('ug','Aug')

#### Change data format

Another annoying task is changing data format. We have at our disposal a series of methods to do that, not only for standard formats such as strings or integers, but also for "special" ones like datetime.

In [29]:
# We can check data format by using .dtypes
allTitles.dtypes

title             object
available_date    object
rating_perc       object
img_path          object
dtype: object

In [11]:
# Use the datetime module to convert dates from strings to datetime objects
allTitles['available_date'] = allTitles['available_date'].map(lambda x: datetime.strptime(x, '%b %d'))

In [12]:
# Sometimes we encounter hiccups, for example here the year assignes id totally random, since we didn't have it
# when scraping.
allTitles['available_date'].unique()

array(['1900-12-22T00:00:00.000000000', '1900-01-05T00:00:00.000000000',
       '1900-12-25T00:00:00.000000000', '1900-01-01T00:00:00.000000000',
       '1900-01-07T00:00:00.000000000', '1900-12-11T00:00:00.000000000',
       '1900-12-15T00:00:00.000000000', '1900-12-23T00:00:00.000000000',
       '1900-12-18T00:00:00.000000000', '1900-01-12T00:00:00.000000000',
       '1900-12-30T00:00:00.000000000', '1900-12-29T00:00:00.000000000',
       '1900-01-08T00:00:00.000000000', '1900-12-21T00:00:00.000000000',
       '1900-11-27T00:00:00.000000000', '1900-11-20T00:00:00.000000000',
       '1900-11-13T00:00:00.000000000', '1900-12-01T00:00:00.000000000',
       '1900-11-17T00:00:00.000000000', '1900-11-24T00:00:00.000000000',
       '1900-11-25T00:00:00.000000000', '1900-12-04T00:00:00.000000000',
       '1900-11-03T00:00:00.000000000', '1900-10-27T00:00:00.000000000',
       '1900-11-06T00:00:00.000000000', '1900-11-10T00:00:00.000000000',
       '1900-10-23T00:00:00.000000000', '1900-10-30

In [13]:
# We cheat a bit ;), using the datetime method to replace parts of our dates we assign the year 2020 to every row
allTitles['available_date'] = allTitles['available_date'].apply(lambda dt: dt.replace(year=2020))

In [14]:
# However since we scraped from July 2020 until today we change year to January entries

# Define our condition
condition = (allTitles['available_date'] < '2020-02-01')

# We use numpy to selectively change rows that match our initial condition, other rows will remain unchanged
allTitles['available_date'] = np.where(condition, allTitles['available_date'].apply(lambda dt: dt.replace(year=2021)), allTitles['available_date'].apply(lambda dt: dt.replace(year=2020)))

In [16]:
allTitles.head(5)

Unnamed: 0,title,rating_perc,available_date,img_path
0,The Emoji Story (Picture Character),91,2020-12-22,data/img/poster-1.jpg
1,Yellow Rose,86,2021-01-05,data/img/poster-2.jpg
2,Soul,96,2020-12-25,data/img/poster-3.jpg
3,Sing Me A Song,88,2021-01-01,data/img/poster-4.jpg
4,Pieces Of A Woman,77,2021-01-07,data/img/poster-5.jpg


#### How to visualize

Once we have cleaned data – be them definitive or intermediate results – we can use python libraries for data visualization such as [mathplot lib](https://matplotlib.org/) or [seaborn](https://seaborn.pydata.org/) or we can export our data to use them in other softwares. For the purpose of this exercise we will show you how to use RawGraph, an open source platform to create simple graphs.

In [184]:

# Export data
allTitles.to_csv('data/titlesDataClean_extended.csv', sep=",", index=False)

#### Movies rating distribution between July 2020 and January 2021

![movies-distribution](visualizations/movies-distribution.png)