# Cleaning Dirty Data with Pandas & Python
http://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/


In [None]:
import pandas as pd

In [None]:
#For this example, we work with a 5,000 movies scraped from IMDB file. This file has dirty data
examplefile_path = "../files/movie_metadata.csv"
cleanexamplefile_path = "../files/out_movie_metadata.csv"

In [None]:
#Read the example CSV
data = pd.read_csv(examplefile_path)

## Look at your data

In [None]:
#To check out the basic structure of the data we just read in, you can use the head() command to print out the first five rows.
data.head()

In [None]:
#Look at the some basic stats for the ‘imdb_score’ column
data.imdb_score.describe()

In [None]:
#Select a column: 
data["movie_title"]

In [None]:
#Select the first 10 rows of a column:
data["duration"][:10]

In [None]:
#Select multiple columns: 
data[["budget","gross"]]

In [None]:
#Select all movies over two hours long: 
data[data["duration"] > 120]

## Deal with missing data
There are a couple of ways to deal with missing data:
 - Add in a default value for the missing data
 - Get rid of (delete) the rows that have missing data
 - Get rid of (delete) the columns that have a high incidence of missing data

In [None]:
## Add default values

#This replaces the NaN entries in the ‘country’ column with the empty string,
data.country = data.country.fillna("")
#data.country = data.country.fillna("None Given")  #Replaces NaN entries with "None Given" string

#This replaces the NaN entries in the 'duration' column with mean duration of the rest of movies
data.duration = data.duration.fillna(data.duration.mean())

In [None]:
## Remove incomplete rows

#Dropping all rows with any NA values is easy:
data.dropna()

#Of course, we can also drop rows that have all NA values:
data.dropna(how='all')

#We can also put a limitation on how many non-null values need to be in a row in order to keep it
data.dropna(thresh=5)

#We don’t want to include any movie that doesn’t have information on when the movie came out:
data.dropna(subset=['title_year'])


In [None]:
## Deal with error-prone columns
#We can apply the same kind of criteria to our columns. 
#We just need to use the parameter axis=1 in our code. That means to operate on columns, not rows.

#Drop the columns with that are all NA values:
data.dropna(axis=1, how='all')

#Drop all columns with any NA values:
data.dropna(axis=1, how='any')

## Normalize data types
Sometimes, especially when you’re reading in a CSV with a bunch of numbers, some of the numbers will read in as strings instead of numeric values, or vice versa. 

In [None]:
#Normalize data types: While you read the csv, you normalize the duration column datatype to int
data = pd.read_csv(examplefile_path, dtype={'duration': float})

#we want the release year to be a string and not a number
data = pd.read_csv(examplefile_path, dtype={'title_year': str})

## Change casing
People make typos, leave their caps lock on (or off), and add extra spaces where they shouldn’t.

In [None]:
#movie titles to uppercase:
data['movie_title'].str.upper()

#get rid of trailing whitespace:
data['movie_title'].str.strip()

## Rename columns
Computer-generated column names can be hard to read and understand while working, so if you want to rename a column to something more user-friendly

In [None]:
data.rename(columns = {'title_year':'release_date', 'movie_facebook_likes':'facebook_likes'})

#you’ll need to save the DataFrame by assigning it to a variable.
data = data.rename(columns = {'title_year':'release_date', 'movie_facebook_likes':'facebook_likes'})

## Save your results
When you’re done cleaning your data, you may want to export it back into CSV format for further processing in another program. 

In [None]:
data.to_csv(cleanexamplefile_path, encoding='utf-8')

## More resources
Of course, this is only the tip of the iceberg. With variations in user environments, languages, and user input, there are many ways that a potential dataset may be dirty or corrupted. At this point you should have learned some of the most common ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these additional resources:

- Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/
- Messy Data Tutorial: http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/v0.1/cookbook/Chapter%207%20-%20Cleaning%20up%20messy%20data.ipynb
- Kaggle Datasets: https://www.kaggle.com/datasets
- Python for Data Analysis (“The Pandas Book”): http://shop.oreilly.com/product/0636920023784.do