## Data cleaning with pandas

This is a work in progress notebook. The idea is to create a basic tutorial on data cleaning. I will use mainly python and eventually some library that allows me to get colors from images (since I have movie covers).

**Possible tasks:**
- Clean and consolidate data columns, remove % mark from score and strings from release date.
- Get main colors from cover images
- Create new column holding cover color information

In [117]:
import pandas as pd
import numpy as np
from datetime import datetime

In [170]:
allTitles = pd.read_csv('data/titlesDataRaw_extended.csv', sep=",")

In [171]:
# Visualizing imported data
allTitles.sample(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
221,221,Jazz On A Summer'S Day,100%,Available Aug 12,data/img/poster-222.jpg
151,151,Hosts,82%,Available Oct 2,data/img/poster-152.jpg
259,259,Waiting For The Barbarians,53%,Available Aug 7,data/img/poster-260.jpg
61,61,Elyse,50%,Available Dec 4,data/img/poster-62.jpg
256,256,Promare,97%,Available Aug 4,data/img/poster-257.jpg


In [172]:
allTitles.head(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
0,0,The Emoji Story (Picture Character),91%,Available Dec 22,data/img/poster-1.jpg
1,1,Yellow Rose,86%,Available Jan 5,data/img/poster-2.jpg
2,2,Soul,96%,Available Dec 25,data/img/poster-3.jpg
3,3,Sing Me A Song,88%,Available Jan 1,data/img/poster-4.jpg
4,4,Pieces Of A Woman,77%,Available Jan 7,data/img/poster-5.jpg


In [173]:
allTitles.tail(5)

Unnamed: 0.1,Unnamed: 0,title,rating,available,img
309,309,The Silencing,17%,Available Jul 16,data/img/poster-310.jpg
310,310,Showbiz Kids,96%,Available Jul 14,data/img/poster-311.jpg
311,311,Deep Blue Sea 3,71%,Available Jul 28,data/img/poster-312.jpg
312,312,Helmut Newton: The Bad And The Beautiful,69%,Available Jul 24,data/img/poster-313.jpg
313,313,Host,100%,Available Jul 30,data/img/poster-314.jpg


In [174]:
# Simple reassignments, column dropping and so on.
allTitles = allTitles.drop(['Unnamed: 0'], axis=1)

In [175]:
allTitles.columns = ['title', 'rating_perc', 'available_date', 'img_path']

In [176]:
# Good example to show how to iterate over rows without forcing for loops
allTitles['rating_perc'] = allTitles['rating_perc'].map(lambda x: x.rstrip('%'))

In [177]:
allTitles['available_date'] = allTitles['available_date'].map(lambda x: x.lstrip('Available '))

In [178]:
allTitles['available_date'] = allTitles['available_date'].str.replace('ug','Aug')

In [179]:
allTitles['available_date'] = allTitles['available_date'].map(lambda x: datetime.strptime(x, '%b %d'))

In [180]:
allTitles['available_date'].unique()

array(['1900-12-22T00:00:00.000000000', '1900-01-05T00:00:00.000000000',
       '1900-12-25T00:00:00.000000000', '1900-01-01T00:00:00.000000000',
       '1900-01-07T00:00:00.000000000', '1900-12-11T00:00:00.000000000',
       '1900-12-15T00:00:00.000000000', '1900-12-23T00:00:00.000000000',
       '1900-12-18T00:00:00.000000000', '1900-01-12T00:00:00.000000000',
       '1900-12-30T00:00:00.000000000', '1900-12-29T00:00:00.000000000',
       '1900-01-08T00:00:00.000000000', '1900-12-21T00:00:00.000000000',
       '1900-11-27T00:00:00.000000000', '1900-11-20T00:00:00.000000000',
       '1900-11-13T00:00:00.000000000', '1900-12-01T00:00:00.000000000',
       '1900-11-17T00:00:00.000000000', '1900-11-24T00:00:00.000000000',
       '1900-11-25T00:00:00.000000000', '1900-12-04T00:00:00.000000000',
       '1900-11-03T00:00:00.000000000', '1900-10-27T00:00:00.000000000',
       '1900-11-06T00:00:00.000000000', '1900-11-10T00:00:00.000000000',
       '1900-10-23T00:00:00.000000000', '1900-10-30

In [181]:
allTitles['available_date'] = allTitles['available_date'].apply(lambda dt: dt.replace(year=2020))

In [182]:
condition = (allTitles['available_date'] < '2020-02-01')
allTitles['available_date'] = np.where(condition, allTitles['available_date'].apply(lambda dt: dt.replace(year=2021)), allTitles['available_date'].apply(lambda dt: dt.replace(year=2020)))

In [183]:
allTitles

Unnamed: 0,title,rating_perc,available_date,img_path
0,The Emoji Story (Picture Character),91,2020-12-22,data/img/poster-1.jpg
1,Yellow Rose,86,2021-01-05,data/img/poster-2.jpg
2,Soul,96,2020-12-25,data/img/poster-3.jpg
3,Sing Me A Song,88,2021-01-01,data/img/poster-4.jpg
4,Pieces Of A Woman,77,2021-01-07,data/img/poster-5.jpg
...,...,...,...,...
309,The Silencing,17,2020-07-16,data/img/poster-310.jpg
310,Showbiz Kids,96,2020-07-14,data/img/poster-311.jpg
311,Deep Blue Sea 3,71,2020-07-28,data/img/poster-312.jpg
312,Helmut Newton: The Bad And The Beautiful,69,2020-07-24,data/img/poster-313.jpg


In [184]:
allTitles.to_csv('data/titlesDataClean_extended.csv', sep=",", index=False)