<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/pandas_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Clean Up Pandas Data

Here we should how to clean up Pandas data, in particular what do with about missing data, as well as what to do with invalid data.

Consider a survey.  If you send people a survey you cannot control in all cases what questions they will answer.  And you cannot anticipate what data they might put that is invalid.  For example, they might leave off some data.  Or they might enter a number in a question that is only supposed to be text.

So here we show you how to :

* generate some random data that is purposefully not clean
* drop duplicate rows
* get rid of rows that have missing values
* convert missing values to something else, like a fixed value of the average of all the other values in the colummn
* delete outliers, which are obvious typos.  For example, here we enter some salaries as 1 million while everyone else is around 100,000.  So that's most likely a mistake (since these are employees and not company owners or the CEO).
* apply a custom function to every row to do whatever special checking you want
* check for missing values
* show different ways to check for numbers or strings and how to convert those when they are of the wrong type

# Bad Data

Below we create some data and purposely add some bad data.  What do we do with this data?  Do we fix it? We we erase it?  Do we drop entire rows?  If this was a survey of 100 question you would not want to delete the rows because almost every person will either put bad data or not answer all questions.

In this data we have:

1. blank data in numeric columns
2. NaN (not a number) data in numeric columns
3. blank values in text columns
4. numbers that are multiple standard deviation away from the other answers, suggesting a type
5. Number in a Yes-No column

We will show how to clean up each.




In [108]:
import numpy as np
import pandas as pd

import random

def makedata():
  mean=10000
  std=25

  # this code creates random data.  It adds invalid and missing values to give us data to work with.

  cols = [("name", str), ("education", str),
     ("age", np.int8), ("city",str), ("id", np.int8), ("email", str), ("salary", np.int8),
        ("citizen", ["Y", "N"])]

  words = [np.NaN, "", "abc", "def", "ghi", "jkl", "mno", "pqr"]

  records = []

  for i in range(20):

    data = {}

    for c in cols:

      if c[1] == np.int8:
        if random.randint(0,5)==5:
            data[c[0]] = np.NaN
        else:
            data[c[0]] = abs(int(random.gauss(mean, std)))

      if c[0] == "citizen":
        if random.randint(0,5)==5:
            data[c[0]] = random.randint(0,10)
        else:
            data[c[0]] =c[1][random.randint(0,1)]

      if c[1] == str and c[0] != "citizen":
        data[c[0]] = words[random.randint(0,len(words)-1)]

      if (c[0] == "income") & (random.randint(0,5)==0):
          data[c[0]] = chr(random.randint(65,70))

      if (c[0] == "income") & (random.randint(0,10)==0):
          data[c[0]] = 1000000

      if (c[0] == "age"):
          data[c[0]] = random.randint(20,25)

    records.append(data)

  df=pd.DataFrame(records)

  return df




df = makedata()
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25,pqr,9971.0,abc,9980.0,N
1,ghi,ghi,20,def,9979.0,def,10014.0,Y
2,pqr,,25,def,10008.0,abc,10012.0,N
3,def,def,23,jkl,9986.0,ghi,9966.0,N
4,pqr,mno,20,jkl,10012.0,jkl,,N
5,mno,def,21,abc,9983.0,ghi,9985.0,Y
6,ghi,ghi,20,jkl,9998.0,pqr,10017.0,Y
7,ghi,abc,23,jkl,9995.0,pqr,9964.0,N
8,jkl,,25,mno,9980.0,abc,10037.0,Y
9,,,22,jkl,10005.0,,10010.0,Y


In [109]:
# check if column in series is empty

df['name'].isnull()


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17     True
18    False
19    False
Name: name, dtype: bool

In [110]:
# draw from mean()
# we can use the mean() function on a series.  A series is one column.
# Then we can replace blank values with the mean.  This is logical.

sMean=df['salary'].mean()

df['salary'] = df['salary'].fillna(sMean)

df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25,pqr,9971.0,abc,9980.0,N
1,ghi,ghi,20,def,9979.0,def,10014.0,Y
2,pqr,,25,def,10008.0,abc,10012.0,N
3,def,def,23,jkl,9986.0,ghi,9966.0,N
4,pqr,mno,20,jkl,10012.0,jkl,10002.294118,N
5,mno,def,21,abc,9983.0,ghi,9985.0,Y
6,ghi,ghi,20,jkl,9998.0,pqr,10017.0,Y
7,ghi,abc,23,jkl,9995.0,pqr,9964.0,N
8,jkl,,25,mno,9980.0,abc,10037.0,Y
9,,,22,jkl,10005.0,,10010.0,Y


In [111]:
# Yes, No columns cannot contain numbers
# NaN is OK as we can get rid of it in a second step plus certain functions like mean()
# will ignore it.  coerce means convert to NaN on error

df['citizen'] = df['citizen'].apply(lambda x: x if x in ['Y', 'N'] else "")

df['citizen']

0     N
1     Y
2     N
3     N
4     N
5     Y
6     Y
7     N
8     Y
9     Y
10    Y
11    Y
12    N
13    N
14    N
15     
16    Y
17    N
18     
19    N
Name: citizen, dtype: object

In [112]:
# drop outliers, too many standard deviations away
# if the income - mean > 2 std the replace then drop the row
# remeber to use inplace=True


mean = df['salary'].mean()
std = df['salary'].std()

# Define threshold for outliers (e.g., values more than 2 standard deviations away from the mean)
threshold = 2

df.drop(df.loc[(df['salary'] - mean) > 2 * std].index, inplace=True)
df



Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25,pqr,9971.0,abc,9980.0,N
1,ghi,ghi,20,def,9979.0,def,10014.0,Y
2,pqr,,25,def,10008.0,abc,10012.0,N
3,def,def,23,jkl,9986.0,ghi,9966.0,N
4,pqr,mno,20,jkl,10012.0,jkl,10002.294118,N
5,mno,def,21,abc,9983.0,ghi,9985.0,Y
6,ghi,ghi,20,jkl,9998.0,pqr,10017.0,Y
7,ghi,abc,23,jkl,9995.0,pqr,9964.0,N
8,jkl,,25,mno,9980.0,abc,10037.0,Y
9,,,22,jkl,10005.0,,10010.0,Y


In [113]:
# drop all numbers in Yes/No answer

df['citizen'] = df['citizen'].apply(lambda x: x if isinstance(x,str) else np.NaN)
df['citizen']


0     N
1     Y
2     N
3     N
4     N
5     Y
6     Y
7     N
8     Y
9     Y
10    Y
11    Y
12    N
13    N
14    N
15     
16    Y
17    N
18     
19    N
Name: citizen, dtype: object

In [114]:
# Here is how to drop all rows that have any NaN values in any column.
# we will leave off inplace=True so that we don't delete all the data that we need for this lesson.
# remember that if you don't put inplace=True and you have not assigned the df.somefunction() to
# some value then you have effectively done nothing as nothing has changed

df.dropna(how='any')
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25,pqr,9971.0,abc,9980.0,N
1,ghi,ghi,20,def,9979.0,def,10014.0,Y
2,pqr,,25,def,10008.0,abc,10012.0,N
3,def,def,23,jkl,9986.0,ghi,9966.0,N
4,pqr,mno,20,jkl,10012.0,jkl,10002.294118,N
5,mno,def,21,abc,9983.0,ghi,9985.0,Y
6,ghi,ghi,20,jkl,9998.0,pqr,10017.0,Y
7,ghi,abc,23,jkl,9995.0,pqr,9964.0,N
8,jkl,,25,mno,9980.0,abc,10037.0,Y
9,,,22,jkl,10005.0,,10010.0,Y


In [115]:
# here run a function on a Pandas series, which is a single column.  But since we are
# sending a single row-column combination we can work with it as we would with any
# Python primitive (meaning built-in type.  Pandas and Numpy are extension of Python into new types.)
# In the next example we show how to work with an entire row, where we have all columns we can work with


def toUpper(s):
    if isinstance(s,str):
      return s.upper()


df['city']=df['city'].apply(toUpper)
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25,PQR,9971.0,abc,9980.0,N
1,ghi,ghi,20,DEF,9979.0,def,10014.0,Y
2,pqr,,25,DEF,10008.0,abc,10012.0,N
3,def,def,23,JKL,9986.0,ghi,9966.0,N
4,pqr,mno,20,JKL,10012.0,jkl,10002.294118,N
5,mno,def,21,ABC,9983.0,ghi,9985.0,Y
6,ghi,ghi,20,JKL,9998.0,pqr,10017.0,Y
7,ghi,abc,23,JKL,9995.0,pqr,9964.0,N
8,jkl,,25,MNO,9980.0,abc,10037.0,Y
9,,,22,JKL,10005.0,,10010.0,Y


In [116]:
# axis = 1 means row
# axis = 0 means column

# the important point to note there is we send the entire row in as a paramters. thus all the columns
# are available to use.  remember to send back the entire row after you have updated any of the columns.

def wholeRow(row):
  if isinstance(row['city'],str):
     row['city'] = row['city'].lower()
     return row


df=df.apply(wholeRow, axis=1)
df

Unnamed: 0,name,education,age,city,id,email,salary,citizen
0,mno,def,25.0,pqr,9971.0,abc,9980.0,N
1,ghi,ghi,20.0,def,9979.0,def,10014.0,Y
2,pqr,,25.0,def,10008.0,abc,10012.0,N
3,def,def,23.0,jkl,9986.0,ghi,9966.0,N
4,pqr,mno,20.0,jkl,10012.0,jkl,10002.294118,N
5,mno,def,21.0,abc,9983.0,ghi,9985.0,Y
6,ghi,ghi,20.0,jkl,9998.0,pqr,10017.0,Y
7,ghi,abc,23.0,jkl,9995.0,pqr,9964.0,N
8,jkl,,25.0,mno,9980.0,abc,10037.0,Y
9,,,22.0,jkl,10005.0,,10010.0,Y


# Eliminate Duplicates

Here we show how to get rid of duplicate rows.  


In [117]:
df = pd.DataFrame({
    "a" : [1,2,3,3,3,3,5,6,7,7,7,7]

})

df.groupby('a')['a'].count()


a
1    1
2    1
3    4
5    1
6    1
7    4
Name: a, dtype: int64

In [118]:
df.value_counts()

a
3    4
7    4
1    1
2    1
5    1
6    1
Name: count, dtype: int64

In [119]:
df.drop_duplicates(inplace=True)
df.groupby('a')['a'].count()

a
1    1
2    1
3    1
5    1
6    1
7    1
Name: a, dtype: int64

In [120]:
df['a'].value_counts()

a
1    1
2    1
3    1
5    1
6    1
7    1
Name: count, dtype: int64

In [121]:
df = makedata()

df.groupby('age')['age'].count()

age
20    2
21    5
22    2
23    3
24    4
25    4
Name: age, dtype: int64

In [122]:
# keep='first' Mark duplicates as True except for the first occurrence

df['duplicate'] = df.duplicated(subset=['age'], keep='first')


df

Unnamed: 0,name,education,age,city,id,email,salary,citizen,duplicate
0,jkl,jkl,20,,9981.0,ghi,10042.0,Y,False
1,def,ghi,21,abc,10017.0,pqr,10036.0,N,False
2,jkl,pqr,23,pqr,,pqr,10017.0,N,False
3,jkl,pqr,25,def,9980.0,abc,9991.0,Y,False
4,abc,,23,def,10011.0,,10014.0,Y,True
5,abc,jkl,21,jkl,,ghi,10017.0,6,True
6,mno,abc,25,mno,10024.0,abc,9989.0,8,True
7,jkl,pqr,21,,,,9998.0,N,True
8,jkl,ghi,24,mno,10019.0,def,9993.0,N,False
9,jkl,abc,21,mno,9993.0,,10008.0,N,True


In [123]:
# drop those where duplicate is true.  Notice the gap in index values show which rows were dropped


df[df['duplicate'] == True]

Unnamed: 0,name,education,age,city,id,email,salary,citizen,duplicate
4,abc,,23,def,10011.0,,10014.0,Y,True
5,abc,jkl,21,jkl,,ghi,10017.0,6,True
6,mno,abc,25,mno,10024.0,abc,9989.0,8,True
7,jkl,pqr,21,,,,9998.0,N,True
9,jkl,abc,21,mno,9993.0,,10008.0,N,True
11,jkl,ghi,21,jkl,10010.0,ghi,9990.0,N,True
12,pqr,abc,24,,10008.0,pqr,9991.0,Y,True
13,,,22,mno,,pqr,10076.0,Y,True
14,,ghi,25,ghi,9979.0,,9943.0,N,True
15,jkl,ghi,24,jkl,9992.0,pqr,10028.0,Y,True
