<a href="https://colab.research.google.com/github/werowe/HypatiaAcademy/blob/master/pandas/pandas_missing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Clean Up Pandas Data

Here we should how to clean up Pandas data, in particular what do with about missing data, as well as what to do with invalid data.

Consider a survey.  If you send people a survey you cannot control in all cases what questions they will answer.  And you cannot anticipate what data they might put that is invalid.  For example, they might leave off some data.  Or they might enter a number in a question that is only supposed to be text.

So here we show you how to :

* generate some random data that is purposefully not clean
* drop duplicate rows
* get rid of rows that have missing values
* convert missing values to something else, like a fixed value of the average of all the other values in the colummn
* delete outliers, which are obvious typos.  For example, here we enter some salaries as $1 million while everyone else is around $100,000.  So that's most likely a mistake (since these are employees and not company owners or the CEO).
* apply a custom function to every row to do whatever special checking you want
* check for missing values
* show different ways to check for numbers or strings and how to convert those when they are of the wrong type

In [14]:
import numpy as np
import pandas as pd

import random

mean=10000
std=25



cols = [("name", str), ("income",np.int8), ("education", str),
     ("age", str), ("city",str), ("id", np.int8), ("email", str), ("salary", np.int8),
        ("citizen", ["Y", "N"])]

words = [np.NaN, "", "abc", "def", "ghi", "jkl", "mno", "pqr"]

records = []


for i in range(20):

  data = {}

  for c in cols:

    if c[1] == np.int8:
      if random.randint(0,5)==5:
          data[c[0]] = np.NaN
      else:
          data[c[0]] = abs(int(random.gauss(mean, std)))

    if c[0] == "citizen":
       if random.randint(0,5)==5:
          data[c[0]] = random.randint(0,10)
       else:
          data[c[0]] =c[1][random.randint(0,1)]

    if c[1] == str and c[0] != "citizen":
       data[c[0]] = words[random.randint(0,len(words)-1)]

    if (c[0] == "income") & (random.randint(0,5)==0):
        data[c[0]] = chr(random.randint(65,70))

    if (c[0] == "income") & (random.randint(0,10)==0):
        data[c[0]] = 1000000

  records.append(data)

df=pd.DataFrame(records)


df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
0,abc,,def,def,def,9987.0,pqr,,N
1,mno,9985,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000,def,abc,def,10022.0,mno,,Y
3,def,9998,abc,,ghi,9987.0,jkl,9997.0,Y
4,abc,F,abc,ghi,,10027.0,mno,10013.0,Y
5,ghi,,pqr,def,abc,10007.0,def,,N
6,def,C,mno,def,jkl,9979.0,,9999.0,3
7,,C,,abc,abc,10009.0,,10050.0,N
8,def,10037,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021,abc,jkl,def,9951.0,jkl,10017.0,Y


In [15]:
# check if column in series is empty

df['name'].isnull()


0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
Name: name, dtype: bool

In [16]:
# draw from mean()

sMean=df['salary'].mean()

df['salary'] = df['salary'].fillna(sMean)

df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
0,abc,,def,def,def,9987.0,pqr,9998.2,N
1,mno,9985,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000,def,abc,def,10022.0,mno,9998.2,Y
3,def,9998,abc,,ghi,9987.0,jkl,9997.0,Y
4,abc,F,abc,ghi,,10027.0,mno,10013.0,Y
5,ghi,,pqr,def,abc,10007.0,def,9998.2,N
6,def,C,mno,def,jkl,9979.0,,9999.0,3
7,,C,,abc,abc,10009.0,,10050.0,N
8,def,10037,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021,abc,jkl,def,9951.0,jkl,10017.0,Y


In [17]:
# Yes, No columns cannot contain numbers
# NaN is OK as we can get rid of it in a second step plus certain functions like mean()
# will ignore it.  coerce means convert to NaN on error

df['income'] = df['income'].apply(lambda x: pd.to_numeric(x, errors='coerce'))

df['income']

0           NaN
1        9985.0
2       10000.0
3        9998.0
4           NaN
5           NaN
6           NaN
7           NaN
8       10037.0
9       10021.0
10          NaN
11      10005.0
12       9975.0
13      10026.0
14          NaN
15    1000000.0
16    1000000.0
17      10020.0
18      10026.0
19    1000000.0
Name: income, dtype: float64

In [18]:
# drop outliers, too many standard deviations away


mean = df['income'].mean()
std_dev = df['income'].std()

# Define threshold for outliers (e.g., values more than 2 standard deviations away from the mean)
threshold = 2

df.loc[(df['income'] - mean) > 2 * std_dev]



Unnamed: 0,name,income,education,age,city,id,email,salary,citizen


In [19]:
df.drop(df.loc[(df['income'] - mean) > 2 * std_dev].index, inplace=True)

df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
0,abc,,def,def,def,9987.0,pqr,9998.2,N
1,mno,9985.0,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,def,10022.0,mno,9998.2,Y
3,def,9998.0,abc,,ghi,9987.0,jkl,9997.0,Y
4,abc,,abc,ghi,,10027.0,mno,10013.0,Y
5,ghi,,pqr,def,abc,10007.0,def,9998.2,N
6,def,,mno,def,jkl,9979.0,,9999.0,3
7,,,,abc,abc,10009.0,,10050.0,N
8,def,10037.0,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,def,9951.0,jkl,10017.0,Y


In [20]:
# drop all numbers in Yes/No answer

df['citizen'] = df['citizen'].apply(lambda x: x if isinstance(x,str) else np.NaN)
df['citizen']


0       N
1       N
2       Y
3       Y
4       Y
5       N
6     NaN
7       N
8       N
9       Y
10      Y
11      Y
12    NaN
13      Y
14      Y
15      N
16      Y
17      N
18      Y
19      Y
Name: citizen, dtype: object

In [21]:
df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
0,abc,,def,def,def,9987.0,pqr,9998.2,N
1,mno,9985.0,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,def,10022.0,mno,9998.2,Y
3,def,9998.0,abc,,ghi,9987.0,jkl,9997.0,Y
4,abc,,abc,ghi,,10027.0,mno,10013.0,Y
5,ghi,,pqr,def,abc,10007.0,def,9998.2,N
6,def,,mno,def,jkl,9979.0,,9999.0,
7,,,,abc,abc,10009.0,,10050.0,N
8,def,10037.0,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,def,9951.0,jkl,10017.0,Y


In [22]:
# now we are left with only those rows that have all non-Nan columns.  Note that we still
# have some blanks sells.  You would choose the various procedures we have explained above
# to determine how you want to handle this situation.

df.dropna(how='any', inplace=True)
df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
1,mno,9985.0,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,def,10022.0,mno,9998.2,Y
8,def,10037.0,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,def,9951.0,jkl,10017.0,Y
18,ghi,10026.0,,pqr,jkl,10008.0,pqr,10006.0,Y
19,def,1000000.0,ghi,abc,def,9981.0,def,9996.0,Y


In [23]:
# here we operate on a Pandas series, which is a single column.  But since we are
# sending each row one row at a time we are operating on a single row-column combination
# This s is a single value, a scalar.  So we can work with it as we would with any
# Python primitive (meaning built-in type.  Pandas and Numpy are extension of Python into new types.)


def toUpper(s):
    if isinstance(s,str):
      return s.upper()


df['city']=df['city'].apply(toUpper)
df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
1,mno,9985.0,abc,pqr,MNO,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,DEF,10022.0,mno,9998.2,Y
8,def,10037.0,def,def,GHI,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,DEF,9951.0,jkl,10017.0,Y
18,ghi,10026.0,,pqr,JKL,10008.0,pqr,10006.0,Y
19,def,1000000.0,ghi,abc,DEF,9981.0,def,9996.0,Y


In [24]:
# axis = 1 means row
# axis = 0 means column

# the important point to note there is we send the entire row in as a paramters. thus all the columns
# are available to use.  remember to send back the entire row after you have updated any of the columns.

def wholeRow(row):
  if isinstance(row['city'],str):
     row['city'] = row['city'].lower()
     return row


df=df.apply(wholeRow, axis=1)
df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
1,mno,9985.0,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,def,10022.0,mno,9998.2,Y
8,def,10037.0,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,def,9951.0,jkl,10017.0,Y
18,ghi,10026.0,,pqr,jkl,10008.0,pqr,10006.0,Y
19,def,1000000.0,ghi,abc,def,9981.0,def,9996.0,Y


In [25]:
df

Unnamed: 0,name,income,education,age,city,id,email,salary,citizen
1,mno,9985.0,abc,pqr,mno,9996.0,jkl,9973.0,N
2,pqr,10000.0,def,abc,def,10022.0,mno,9998.2,Y
8,def,10037.0,def,def,ghi,10038.0,pqr,9982.0,N
9,pqr,10021.0,abc,jkl,def,9951.0,jkl,10017.0,Y
18,ghi,10026.0,,pqr,jkl,10008.0,pqr,10006.0,Y
19,def,1000000.0,ghi,abc,def,9981.0,def,9996.0,Y


In [26]:
'''
 eleminate duplicates.  the way to do that is to group all columns and sum.  If the count is bigger then one them delete all but one
 '''


'\n eleminate duplicates.  the way to do that is to group all columns and sum.  If the count is bigger then one them delete all but one\n '