# Cleaning Dirty Data with Pandas & Python
http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/


In [1]:
import pandas as pd

In [2]:
#For this example, we work with a 5,000 movies scraped from IMDB file. This file has dirty data
examplefile_path = "../files/movie_metadata.csv"
cleanexamplefile_path = "../files/out_movie_metadata.csv"

In [3]:
#Read the example CSV
data = pd.read_csv(examplefile_path)

## Look at your data

In [4]:
#To check out the basic structure of the data we just read in, you can use the head() command to print out the first five rows.
data.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [5]:
#Look at the some basic stats for the ‘imdb_score’ column
data.imdb_score.describe()

count    5043.000000
mean        6.442138
std         1.125116
min         1.600000
25%         5.800000
50%         6.600000
75%         7.200000
max         9.500000
Name: imdb_score, dtype: float64

In [6]:
#Select a column: 
data["movie_title"]

0                                                 Avatar 
1               Pirates of the Caribbean: At World's End 
2                                                Spectre 
3                                  The Dark Knight Rises 
4       Star Wars: Episode VII - The Force Awakens    ...
5                                            John Carter 
6                                           Spider-Man 3 
7                                                Tangled 
8                                Avengers: Age of Ultron 
9                 Harry Potter and the Half-Blood Prince 
10                    Batman v Superman: Dawn of Justice 
11                                      Superman Returns 
12                                     Quantum of Solace 
13            Pirates of the Caribbean: Dead Man's Chest 
14                                       The Lone Ranger 
15                                          Man of Steel 
16              The Chronicles of Narnia: Prince Caspian 
17            

In [7]:
#Select the first 10 rows of a column:
data["duration"][:10]

0    178.0
1    169.0
2    148.0
3    164.0
4      NaN
5    132.0
6    156.0
7    100.0
8    141.0
9    153.0
Name: duration, dtype: float64

In [8]:
#Select multiple columns: 
data[["budget","gross"]]

Unnamed: 0,budget,gross
0,237000000.0,760505847.0
1,300000000.0,309404152.0
2,245000000.0,200074175.0
3,250000000.0,448130642.0
4,,
5,263700000.0,73058679.0
6,258000000.0,336530303.0
7,260000000.0,200807262.0
8,250000000.0,458991599.0
9,250000000.0,301956980.0


In [9]:
#Select all movies over two hours long: 
data[data["duration"] > 120]

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,458991599.0,Action|Adventure|Sci-Fi,...,1117.0,English,USA,PG-13,250000000.0,2015.0,21000.0,7.5,2.35,118000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
10,Color,Zack Snyder,673.0,183.0,0.0,2000.0,Lauren Cohan,15000.0,330249062.0,Action|Adventure|Sci-Fi,...,3018.0,English,USA,PG-13,250000000.0,2016.0,4000.0,6.9,2.35,197000
11,Color,Bryan Singer,434.0,169.0,0.0,903.0,Marlon Brando,18000.0,200069408.0,Action|Adventure|Sci-Fi,...,2367.0,English,USA,PG-13,209000000.0,2006.0,10000.0,6.1,2.35,0


## Deal with missing data
There are a couple of ways to deal with missing data:
 - Add in a default value for the missing data
 - Get rid of (delete) the rows that have missing data
 - Get rid of (delete) the columns that have a high incidence of missing data

In [11]:
## Add default values

#This replaces the NaN entries in the ‘country’ column with the empty string,
data.country = data.country.fillna("")
#data.country = data.country.fillna("None Given")  #Replaces NaN entries with "None Given" string

#This replaces the NaN entries in the 'duration' column with mean duration of the rest of movies
data.duration = data.duration.fillna(data.duration.mean())

data.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,107.201074,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [None]:
## Remove incomplete rows

#Dropping all rows with any NA values is easy:
data.dropna()

#Of course, we can also drop rows that have all NA values:
data.dropna(how='all')

#We can also put a limitation on how many non-null values need to be in a row in order to keep it
data.dropna(thresh=5)

#We don’t want to include any movie that doesn’t have information on when the movie came out:
data.dropna(subset=['title_year'])


In [12]:
## Deal with error-prone columns
#We can apply the same kind of criteria to our columns. 
#We just need to use the parameter axis=1 in our code. That means to operate on columns, not rows.

#Drop the columns with that are all NA values:
data.dropna(axis=1, how='all')

#Drop all columns with any NA values:
data.dropna(axis=1, how='any')

Unnamed: 0,duration,genres,movie_title,num_voted_users,cast_total_facebook_likes,movie_imdb_link,country,imdb_score,movie_facebook_likes
0,178.000000,Action|Adventure|Fantasy|Sci-Fi,Avatar,886204,4834,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,USA,7.9,33000
1,169.000000,Action|Adventure|Fantasy,Pirates of the Caribbean: At World's End,471220,48350,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,USA,7.1,0
2,148.000000,Action|Adventure|Thriller,Spectre,275868,11700,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,UK,6.8,85000
3,164.000000,Action|Thriller,The Dark Knight Rises,1144337,106759,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,USA,8.5,164000
4,107.201074,Documentary,Star Wars: Episode VII - The Force Awakens ...,8,143,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,7.1,0
5,132.000000,Action|Adventure|Sci-Fi,John Carter,212204,1873,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,USA,6.6,24000
6,156.000000,Action|Adventure|Romance,Spider-Man 3,383056,46055,http://www.imdb.com/title/tt0413300/?ref_=fn_t...,USA,6.2,0
7,100.000000,Adventure|Animation|Comedy|Family|Fantasy|Musi...,Tangled,294810,2036,http://www.imdb.com/title/tt0398286/?ref_=fn_t...,USA,7.8,29000
8,141.000000,Action|Adventure|Sci-Fi,Avengers: Age of Ultron,462669,92000,http://www.imdb.com/title/tt2395427/?ref_=fn_t...,USA,7.5,118000
9,153.000000,Adventure|Family|Fantasy|Mystery,Harry Potter and the Half-Blood Prince,321795,58753,http://www.imdb.com/title/tt0417741/?ref_=fn_t...,UK,7.5,10000


## Normalize data types
Sometimes, especially when you’re reading in a CSV with a bunch of numbers, some of the numbers will read in as strings instead of numeric values, or vice versa. 

In [15]:
#Normalize data types: While you read the csv, you normalize the duration column datatype to int
data = pd.read_csv(examplefile_path, dtype={'duration': float})

#we want the release year to be a string and not a number
data = pd.read_csv(examplefile_path, dtype={'title_year': str})

## Change casing
People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.

In [16]:
#movie titles to uppercase:
data['movie_title'].str.upper()

#get rid of trailing whitespace:
data['movie_title'].str.strip()

0                                            Avatar
1          Pirates of the Caribbean: At World's End
2                                           Spectre
3                             The Dark Knight Rises
4        Star Wars: Episode VII - The Force Awakens
5                                       John Carter
6                                      Spider-Man 3
7                                           Tangled
8                           Avengers: Age of Ultron
9            Harry Potter and the Half-Blood Prince
10               Batman v Superman: Dawn of Justice
11                                 Superman Returns
12                                Quantum of Solace
13       Pirates of the Caribbean: Dead Man's Chest
14                                  The Lone Ranger
15                                     Man of Steel
16         The Chronicles of Narnia: Prince Caspian
17                                     The Avengers
18      Pirates of the Caribbean: On Stranger Tides
19          

## Rename columns
Computer-generated column names can be hard to read and understand while working, so if you want to rename a column to something more user-friendly

In [17]:
data.rename(columns = {'title_year':'release_date', 'movie_facebook_likes':'facebook_likes'})

#you’ll need to save the DataFrame by assigning it to a variable.
data = data.rename(columns = {'title_year':'release_date', 'movie_facebook_likes':'facebook_likes'})

## Save your results
When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program. 

In [18]:
data.to_csv(cleanexamplefile_path, encoding='utf-8')

## More resources
Of course, this is only the tip of the iceberg. With variations in user environments, languages, and user input, there are many ways that a potential dataset may be dirty or corrupted. At this point you should have learned some of the most common ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these additional resources:

- Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/
- Messy Data Tutorial: http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb
- Kaggle Datasets: https://www.kaggle.com/datasets
- Python for Data Analysis (“The Pandas Book”): http://shop.oreilly.com/product/0636920023784.do