<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/pandas_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Clean Up Pandas Data

> *Video Tutorial*
>
> This tutorials is explained in depth in [this video](https://www.youtube.com/watch?v=6N_ncQjdcAU).

Here we should how to clean up Pandas data, in particular what do with about missing data, as well as what to do with invalid data.

Consider a survey.  If you send people a survey you cannot control in all cases what questions they will answer.  And you cannot anticipate what data they might put that is invalid.  For example, they might leave off some data.  Or they might enter a number in a question that is only supposed to be text.

So here we show you how to :

* generate some random data that is purposefully not clean
* drop duplicate rows
* get rid of rows that have missing values
* convert missing values to something else, like a fixed value of the average of all the other values in the colummn
* delete outliers, which are obvious typos.  For example, here we enter some salaries as 1 million while everyone else is around 100,000.  So that's most likely a mistake (since these are employees and not company owners or the CEO).
* apply a custom function to every row to do whatever special checking you want
* check for missing values
* show different ways to check for numbers or strings and how to convert those when they are of the wrong type

# Bad Data

Below we create some data and purposely add some bad data.  What do we do with this data?  Do we fix it? We we erase it?  Do we drop entire rows?  If this was a survey of 100 question you would not want to delete the rows because almost every person will either put bad data or not answer all questions.

In this data we have:

1. blank data in numeric columns
2. NaN (not a number) data in numeric columns
3. blank values in text columns
4. numbers that are multiple standard deviation away from the other answers, suggesting a type
5. Number in a Yes-No column

We will show how to clean up each.




In [28]:
import numpy as np
import pandas as pd

import random

def makedata():
  mean=10000
  std=25

  # this code creates random data.  It adds invalid and missing values to give us data to work with.

  cols = [("name", str), ("education", str),
     ("age", np.int8), ("city",str), ("id", np.int8), ("email", str), ("salary", np.int8),
        ("citizen", ["Y", "N"])]

  words = [np.NaN, "", "abc", "def", "ghi", "jkl", "mno", "pqr"]

  records = []

  for i in range(20):

    data = {}

    for c in cols:

      if c[1] == np.int8:
        if random.randint(0,5)==5:
            data[c[0]] = np.NaN
        else:
            data[c[0]] = abs(int(random.gauss(mean, std)))

      if c[0] == "citizen":
        if random.randint(0,5)==5:
            data[c[0]] = random.randint(0,10)
        else:
            data[c[0]] =c[1][random.randint(0,1)]

      if c[1] == str and c[0] != "citizen":
        data[c[0]] = words[random.randint(0,len(words)-1)]

      if (c[0] == "salary") & (random.randint(0,5)==0):
            data[c[0]] = 1000000

      if (c[0] == "age"):
          data[c[0]] = random.randint(20,25)

    records.append(data)

  df=pd.DataFrame(records)

  return df

df = makedata()
df


Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,jkl,ghi,25,,9997.0,,,N
1,,,20,jkl,10000.0,def,1000000.0,N
2,abc,ghi,22,pqr,,,9998.0,N
3,pqr,ghi,20,,9994.0,pqr,9974.0,Y
4,ghi,abc,21,,9974.0,pqr,,N
5,ghi,jkl,21,pqr,9994.0,jkl,1000000.0,N
6,def,def,24,pqr,10003.0,pqr,9976.0,3
7,abc,jkl,22,pqr,,,9959.0,Y
8,def,ghi,25,abc,10049.0,mno,9988.0,N
9,,,23,,10019.0,abc,10003.0,N


# picky

# difficult to please

In [29]:
# check if column in series is empty

df['name'].isnull()


Unnamed: 0,name
0,False
1,False
2,False
3,False
4,False
5,False
6,False
7,False
8,False
9,False


In [30]:
# df.loc[logical expression, columns]

df.loc[df['name'].isnull(), :]

Unnamed: 0,name,education,age,city,id,email,salary,citizen
11,,ghi,21,abc,10021.0,,,N
15,,abc,24,abc,9965.0,pqr,9977.0,6


In [3]:
# Here we check if

(df.isna() | df.isnull())


Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,False,False,False,False,False,False,False,False
1,False,False,False,True,False,False,False,False
2,True,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,True,False,False,False,False,False,False
7,True,False,False,False,False,False,False,False
8,False,False,False,True,False,True,False,False
9,False,False,False,False,False,True,False,False


In [31]:
df.loc[df['name'].isnull(), :]

Unnamed: 0,name,education,age,city,id,email,salary,citizen
11,,ghi,21,abc,10021.0,,,N
15,,abc,24,abc,9965.0,pqr,9977.0,6


In [32]:
df.loc[df['name'].isna(), :]

Unnamed: 0,name,education,age,city,id,email,salary,citizen
11,,ghi,21,abc,10021.0,,,N
15,,abc,24,abc,9965.0,pqr,9977.0,6


In [4]:
# None is a Python primitive (an integer, a boolean) it means not an object

# np.Nan is an object

np.NaN == None

False

In [33]:
None

In [5]:
df['salary']

Unnamed: 0,salary
0,10009.0
1,9999.0
2,10009.0
3,10025.0
4,9996.0
5,9966.0
6,9998.0
7,9975.0
8,9997.0
9,9971.0



What do with do with outliers


1, 2, 4, 999999999

If we set the outlier to be the average it will not change the average


1, 2, 4, 1.3


we don't to delete the outliers

outlier?  it means a large number of std aaway from the mean




In [6]:
# draw from mean()
# we can use the mean() function on a series.  A series is one column.
# Then we can replace blank values with the mean.  This is logical.



In [34]:
df['salary'].mean()

195620.75

In [8]:
df['citizen']

Unnamed: 0,citizen
0,N
1,Y
2,Y
3,N
4,N
5,N
6,Y
7,N
8,N
9,Y


In [9]:
# Yes, No columns cannot contain numbers
# NaN is OK as we can get rid of it in a second step plus certain functions like mean()
# will ignore it.  coerce means convert to NaN on error

df['citizen'] = df['citizen'].apply(lambda x: x if x in ['Y', 'N'] else "")

df['citizen']

Unnamed: 0,citizen
0,N
1,Y
2,Y
3,N
4,N
5,N
6,Y
7,N
8,N
9,Y


In [10]:
df['salary']

Unnamed: 0,salary
0,10009.0
1,9999.0
2,10009.0
3,10025.0
4,9996.0
5,9966.0
6,9998.0
7,9975.0
8,9997.0
9,9971.0


In [37]:
mean

62108.05263157895

In [39]:
def makeMean(col):

   col['salary']=mean

   return col


#I think this is making a new dataframe.  Quesiton is now to update the xisting dataframe
# can we use inplace=Trye

x=df.loc[abs(df['salary'] - mean) > 2 * std].apply(makeMean)


#Homework

Fix this.  It should have only updated the 3 rows and only the salary column.

In [43]:
x

Unnamed: 0,name,education,age,city,id,email,salary,citizen
1,,,20.0,jkl,10000.0,def,1000000.0,N
5,ghi,jkl,21.0,pqr,9994.0,jkl,1000000.0,N
17,jkl,def,20.0,abc,9958.0,,1000000.0,Y
salary,62108.052632,62108.052632,62108.052632,62108.052632,62108.052632,62108.052632,62108.052632,62108.052632


In [41]:
x.index

Index([1, 5, 17, 'salary'], dtype='object')

In [42]:
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,jkl,ghi,25,,9997.0,,,N
1,,,20,jkl,10000.0,def,1000000.0,N
2,abc,ghi,22,pqr,,,9998.0,N
3,pqr,ghi,20,,9994.0,pqr,9974.0,Y
4,ghi,abc,21,,9974.0,pqr,,N
5,ghi,jkl,21,pqr,9994.0,jkl,1000000.0,N
6,def,def,24,pqr,10003.0,pqr,9976.0,3
7,abc,jkl,22,pqr,,,9959.0,Y
8,def,ghi,25,abc,10049.0,mno,9988.0,N
9,,,23,,10019.0,abc,10003.0,N


In [11]:
# drop outliers, too many standard deviations away
# if the income - mean > 2 std the replace then drop the row
# remeber to use inplace=True


mean = df['salary'].mean()
std = df['salary'].std()

# Define threshold for outliers (e.g., values more than 2 standard deviations away from the mean)
threshold = 2

# calculate
# find with .loc
# drop
# inplace=True


df.drop(df.loc[abs(df['salary'] - mean) > 2 * std].index, inplace=True)

df



Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,ghi,10027.0,def,10009.0,N
1,ghi,pqr,21,,9986.0,abc,9999.0,Y
2,,jkl,23,abc,10028.0,mno,10009.0,Y
3,jkl,jkl,22,pqr,10043.0,abc,10025.0,N
4,ghi,mno,20,jkl,9980.0,ghi,9996.0,N
5,mno,jkl,21,def,9992.0,mno,9966.0,N
6,,,21,pqr,9994.0,pqr,9998.0,Y
7,,jkl,24,mno,10017.0,mno,9975.0,N
8,ghi,def,23,,9983.0,,9997.0,N
9,jkl,jkl,23,mno,9981.0,,9971.0,Y


In [12]:
abs(1000000 - mean) > 2 * std

True

In [13]:
mean

62108.05263157895

In [14]:
std

227120.90214672344

In [15]:
sMean=df['salary'].mean()

# this will replace missing values with the mean

df['salary'] = df['salary'].fillna(sMean)

df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,ghi,10027.0,def,10009.0,N
1,ghi,pqr,21,,9986.0,abc,9999.0,Y
2,,jkl,23,abc,10028.0,mno,10009.0,Y
3,jkl,jkl,22,pqr,10043.0,abc,10025.0,N
4,ghi,mno,20,jkl,9980.0,ghi,9996.0,N
5,mno,jkl,21,def,9992.0,mno,9966.0,N
6,,,21,pqr,9994.0,pqr,9998.0,Y
7,,jkl,24,mno,10017.0,mno,9975.0,N
8,ghi,def,23,,9983.0,,9997.0,N
9,jkl,jkl,23,mno,9981.0,,9971.0,Y


In [16]:
# drop all numbers in Yes/No answer

df['citizen'] = df['citizen'].apply(lambda x: x if isinstance(x,str) else np.NaN)
df['citizen']


Unnamed: 0,citizen
0,N
1,Y
2,Y
3,N
4,N
5,N
6,Y
7,N
8,N
9,Y


In [17]:
# Here is how to drop all rows that have any NaN values in any column.
# we will leave off inplace=True so that we don't delete all the data that we need for this lesson.
# remember that if you don't put inplace=True and you have not assigned the df.somefunction() to
# some value then you have effectively done nothing as nothing has changed

df.dropna(how='any')
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,ghi,10027.0,def,10009.0,N
1,ghi,pqr,21,,9986.0,abc,9999.0,Y
2,,jkl,23,abc,10028.0,mno,10009.0,Y
3,jkl,jkl,22,pqr,10043.0,abc,10025.0,N
4,ghi,mno,20,jkl,9980.0,ghi,9996.0,N
5,mno,jkl,21,def,9992.0,mno,9966.0,N
6,,,21,pqr,9994.0,pqr,9998.0,Y
7,,jkl,24,mno,10017.0,mno,9975.0,N
8,ghi,def,23,,9983.0,,9997.0,N
9,jkl,jkl,23,mno,9981.0,,9971.0,Y


In [18]:
# here run a function on a Pandas series, which is a single column.  But since we are
# sending a single row-column combination we can work with it as we would with any
# Python primitive (meaning built-in type.  Pandas and Numpy are extension of Python into new types.)
# In the next example we show how to work with an entire row, where we have all columns we can work with


def toUpper(s):
    if isinstance(s,str):
      return s.upper()


df['city']=df['city'].apply(toUpper)
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22,GHI,10027.0,def,10009.0,N
1,ghi,pqr,21,,9986.0,abc,9999.0,Y
2,,jkl,23,ABC,10028.0,mno,10009.0,Y
3,jkl,jkl,22,PQR,10043.0,abc,10025.0,N
4,ghi,mno,20,JKL,9980.0,ghi,9996.0,N
5,mno,jkl,21,DEF,9992.0,mno,9966.0,N
6,,,21,PQR,9994.0,pqr,9998.0,Y
7,,jkl,24,MNO,10017.0,mno,9975.0,N
8,ghi,def,23,,9983.0,,9997.0,N
9,jkl,jkl,23,MNO,9981.0,,9971.0,Y


In [19]:
# axis = 1 means row
# axis = 0 means column

# the important point to note there is we send the entire row in as a paramters. thus all the columns
# are available to use.  remember to send back the entire row after you have updated any of the columns.

def wholeRow(row):
  if isinstance(row['city'],str):
     row['city'] = row['city'].lower()
     return row


df=df.apply(wholeRow, axis=1)
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,ghi,abc,22.0,ghi,10027.0,def,10009.0,N
1,,,,,,,,
2,,jkl,23.0,abc,10028.0,mno,10009.0,Y
3,jkl,jkl,22.0,pqr,10043.0,abc,10025.0,N
4,ghi,mno,20.0,jkl,9980.0,ghi,9996.0,N
5,mno,jkl,21.0,def,9992.0,mno,9966.0,N
6,,,21.0,pqr,9994.0,pqr,9998.0,Y
7,,jkl,24.0,mno,10017.0,mno,9975.0,N
8,,,,,,,,
9,jkl,jkl,23.0,mno,9981.0,,9971.0,Y


# Eliminate Duplicates

Here we show how to get rid of duplicate rows.  


In [20]:
df = pd.DataFrame({
    "a" : [1,2,3,3,3,3,5,6,7,7,7,7]
})

df.groupby('a')['a'].count()


Unnamed: 0_level_0,a
a,Unnamed: 1_level_1
1,1
2,1
3,4
5,1
6,1
7,4


In [21]:
df.value_counts()

Unnamed: 0_level_0,count
a,Unnamed: 1_level_1
3,4
7,4
1,1
2,1
5,1
6,1


In [22]:
df.drop_duplicates(inplace=True)
df.groupby('a')['a'].count()

Unnamed: 0_level_0,a
a,Unnamed: 1_level_1
1,1
2,1
3,1
5,1
6,1
7,1


In [23]:
df['a'].value_counts()

Unnamed: 0_level_0,count
a,Unnamed: 1_level_1
1,1
2,1
3,1
5,1
6,1
7,1


In [24]:
df = makedata()

df.groupby('age')['age'].count()

Unnamed: 0_level_0,age
age,Unnamed: 1_level_1
20,2
21,1
22,4
23,3
24,2
25,8


In [25]:
# keep='first' Mark duplicates as True except for the first occurrence

df['duplicate'] = df.duplicated(subset=['age'], keep='first')

df.sort_values("age")

Unnamed: 0,name,education,age,city,id,email,salary,citizen,duplicate
0,ghi,pqr,20,ghi,10022.0,abc,10035.0,Y,False
17,def,,20,ghi,10036.0,jkl,10032.0,N,True
19,mno,def,21,def,9999.0,abc,1000000.0,Y,False
3,,mno,22,jkl,10038.0,jkl,10009.0,N,True
7,ghi,jkl,22,,10001.0,mno,1000000.0,1,True
14,mno,pqr,22,,10035.0,,9948.0,1,True
2,def,ghi,22,,,,9986.0,9,False
1,,,23,mno,10037.0,ghi,10002.0,4,False
18,def,def,23,mno,10015.0,mno,9999.0,5,True
12,jkl,mno,23,,9993.0,,10019.0,3,True


In [26]:
# drop those where duplicate is true.  Notice the gap in index values show which rows were dropped

df=df[df['duplicate'] == False]

In [27]:
df.sort_values("age")

Unnamed: 0,name,education,age,city,id,email,salary,citizen,duplicate
0,ghi,pqr,20,ghi,10022.0,abc,10035.0,Y,False
19,mno,def,21,def,9999.0,abc,1000000.0,Y,False
2,def,ghi,22,,,,9986.0,9,False
1,,,23,mno,10037.0,ghi,10002.0,4,False
6,pqr,,24,,10015.0,abc,10019.0,Y,False
4,,abc,25,def,9983.0,,9990.0,N,False
