# Data Cleaning with Pandas

In [1]:
import pandas as pd
import numpy as np

### Changing data

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [2]:
import pandas as pd

uci = pd.read_csv('data/heart.csv')

In [3]:
def successor(x):
    return x + 1

In [4]:
uci.applymap(successor).head()
#whole dataframe
#doesn't save unless you reassigned it (include uci = at the beginning)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


The `.map()` method takes a function as input that it will then apply to every entry in the Series.

In [5]:
uci['age'].map(successor).tail(10)
#only designated column
#doesn't save unless you reassigned it (include uci['age'] = at the beginning)

293    68
294    45
295    64
296    64
297    60
298    58
299    46
300    69
301    58
302    58
Name: age, dtype: int64

In [8]:
new_output = _
new_output
#single underscore refers to most recent output
#each underscore represents # of output to go back to e.g. 3 underscores means 3rd previous output

0    63
1    37
2    41
3    56
Name: age, dtype: object

In [10]:
new_output2 = Out[5]
new_output2

293    68
294    45
295    64
296    64
297    60
298    58
299    46
300    69
301    58
302    58
Name: age, dtype: int64

#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [18]:
def round_it(x):
    return round(x)
round_it(uci['oldpeak'])[:4]

0    2.0
1    4.0
2    1.0
3    1.0
Name: oldpeak, dtype: float64

In [6]:
uci['oldpeak'].map(lambda x: round(x))[:4]
#another way of writing simple functions and functions in lambda don't have names

0    2
1    4
2    1
3    1
Name: oldpeak, dtype: int64

Exercise: Use an anonymous function to turn the entries in age to strings

In [7]:
# Your code here
uci['age'].map(lambda x: str(x))[:4]

0    63
1    37
2    41
3    56
Name: age, dtype: object

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [11]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e6424e0>

#### `.groups` and `.get_group()`

In [12]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [20]:
uci.groupby('sex').groups.keys()

dict_keys([0, 1])

In [16]:
uci.groupby('sex').get_group(0).tail()  # .tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


### Aggregating

In [19]:
uci.groupby('sex').std()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,9.409396,0.972427,19.311119,65.088946,0.332455,0.55715,20.047969,0.422503,1.119844,0.593736,0.881026,0.44129,0.435286
1,8.883803,1.059064,16.658246,42.782392,0.366955,0.510754,24.130882,0.484505,1.174632,0.627378,1.074082,0.659949,0.498626


In [21]:
uci['sex_str'] = uci.sex.map(lambda x: 'M' if x == 1 else 'F')
uci.groupby('sex_str').groups.keys()

dict_keys(['F', 'M'])

In [22]:
uci.groupby('sex_str').get_group('F').tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,sex_str
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0,F
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0,F
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0,F
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,F
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0,F


Exercise: Tell me the average cholesterol level for those with heart disease.

In [25]:
# Your code here!
uci.groupby('target').get_group(1)['chol'].mean()

242.23030303030302

In [33]:
uci.loc[uci['target'] == 1]['chol'].mean()
#recommended method instead of groupby

242.23030303030302

In [32]:
uci.groupby('target').get_group(1)['chol'].sort_values().tail()
#top cholestrol level for those with heart disease

4     354
39    360
96    394
28    417
85    564
Name: chol, dtype: int64

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [37]:
uci.pivot(values='sex', columns='target').head()

target,0,1
0,,1.0
1,,1.0
2,,0.0
3,,1.0
4,,0.0


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [27]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'HP'])

In [40]:
toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [41]:
toy2

Unnamed: 0,age,HP
0,63,100
1,33,200


In [42]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B').head()

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


In [47]:
toy1.set_index('age').join(toy2.set_index('age'),
          lsuffix='_A',
          rsuffix='_B').head()

Unnamed: 0_level_0,HP_A,HP_B
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100
33,47,200


### `.merge()`

In [38]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars.head()

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [39]:
states = pd.read_csv('data/states.csv', index_col=0)
states.head()

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [52]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
                how = 'inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [48]:
pd.concat([ds_chars, states], sort=False)

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,,,
1,miles,200.0,WA,,,
2,alan,170.0,TX,,,
3,alison,300.0,DC,,,
4,rachel,200.0,TX,,,
0,,,,WA,evergreen,Olympia
1,,,,TX,alamo,Austin
2,,,,DC,district,Washington
3,,,,OH,buckeye,Columbus
4,,,,OR,beaver,Salem


In [55]:
states.columns = ['home_state', 'nickname', 'capital']
pd.concat([ds_chars, states], sort=False)

Unnamed: 0,name,HP,home_state,nickname,capital
0,greg,200.0,WA,,
1,miles,200.0,WA,,
2,alan,170.0,TX,,
3,alison,300.0,DC,,
4,rachel,200.0,TX,,
0,,,WA,evergreen,Olympia
1,,,TX,alamo,Austin
2,,,DC,district,Washington
3,,,OH,buckeye,Columbus
4,,,OR,beaver,Salem


### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [49]:
ds_chars.head()

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [50]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

Unnamed: 0,name,variable,value
0,greg,HP,200
1,miles,HP,200
2,alan,HP,170
3,alison,HP,300
4,rachel,HP,200
5,greg,home_state,WA
6,miles,home_state,WA
7,alan,home_state,TX
8,alison,home_state,DC
9,rachel,home_state,TX


# Data Cleaning
## Scenario

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

We'll need to download the two data files that we need. We can do this at the command line:

In [60]:
!brew install wget

touch: /usr/local/Homebrew/.git/FETCH_HEAD: Permission denied
[32m==>[0m [1mInstalling dependencies for wget: [32mlibunistring[39m and [32mlibidn2[39m[0m
[32m==>[0m [1mInstalling wget dependency: [32mlibunistring[39m[0m
[34m==>[0m [1mDownloading https://homebrew.bintray.com/bottles/libunistring-0.9.10.mojave.[0m
[34m==>[0m [1mDownloading from https://akamai.bintray.com/1d/1d0c8e266acddcebeef3d9f6162d6[0m
######################################################################## 100.0%
[34m==>[0m [1mPouring libunistring-0.9.10.mojave.bottle.tar.gz[0m
🍺  /usr/local/Cellar/libunistring/0.9.10: 54 files, 4.4MB
[32m==>[0m [1mInstalling wget dependency: [32mlibidn2[39m[0m
[34m==>[0m [1mDownloading https://homebrew.bintray.com/bottles/libidn2-2.2.0_1.mojave.bott[0m
[34m==>[0m [1mDownloading from https://akamai.bintray.com/96/96e9b127a4123a1a4ec67f849467b[0m
######################################################################## 100.0%
[34m==>[0m [1mPo

In [1]:
!cd data
!wget https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
!wget https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip

--2019-09-04 15:33:26--  https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125329626 (120M) [application/x-zip-compressed]
Saving to: ‘Real Property Sales.zip.1’


2019-09-04 15:38:45 (387 KB/s) - ‘Real Property Sales.zip.1’ saved [125329626/125329626]

--2019-09-04 15:38:45--  https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip
Resolving aqua.kingcounty.gov (aqua.kingcounty.gov)... 146.129.240.28
Connecting to aqua.kingcounty.gov (aqua.kingcounty.gov)|146.129.240.28|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24608118 (23M) [application/x-zip-compressed]
Saving to: ‘Residential Building.zip’


2019-09-04 15:39:52 (362 KB/s) - ‘Residential Building.zip’ saved [24608118/24608118]



*Note:* If you do not have the `wget` command yet, you can install it with `brew install wget`, or use `curl <url> -O <filename>`.

Note that `%20` in a URL translates into a space. Even though you should *never put spaces in filenames*, you may need to deal with spaces that _other_ people have used in filenames.

There are two ways to handle the spaces in these filenames when referencing them at the command line.

#### 1. You can _escape_ the spaces by putting a backslash (`\`, remember _backslash is next to backspace_) before each one:

`unzip Real\ Property\ Sales.zip`

This is what happens if you tab-complete the filename in the terminal. Tab completion is your friend!

#### 2. You can put the entire filename in quotes:

`unzip "Real Property Sales.zip"`

Try unzipping these files with the `unzip` command. The `unzip` command takes one argument, the name of the file that you want to unzip.

In [2]:
!unzip Real\ Property\ Sales.zip
!cd ..

Archive:  Real Property Sales.zip
  inflating: EXTR_RPSale.csv         


In [5]:
import pandas as pd
sales_df = pd.read_csv('Real Property Sales.zip')

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
sales_df.head()

Unnamed: 0,ExciseTaxNbr,Major,Minor,DocumentDate,SalePrice,RecordingNbr,Volume,Page,PlatNbr,PlatType,...,PropertyType,PrincipalUse,SaleInstrument,AFForestLand,AFCurrentUseLand,AFNonProfitUse,AFHistoricProperty,SaleReason,PropertyClass,SaleWarning
0,2687551,138860,110,08/21/2014,245000,20140828001436,,,,,...,3,6,3,N,N,N,N,1,8,
1,1235111,664885,40,07/09/1991,0,199203161090,71.0,1.0,664885.0,C,...,3,0,26,N,N,N,N,18,3,11
2,2704079,423943,50,10/11/2014,0,20141205000558,,,,,...,3,6,15,N,N,N,N,18,8,18 31 51
3,2584094,403700,715,01/04/2013,0,20130110000910,,,,,...,3,6,15,N,N,N,N,11,8,18 31 38
4,1056831,951120,900,04/20/1989,85000,198904260448,117.0,53.0,951120.0,P,...,3,0,2,N,N,N,N,1,9,49


In [7]:
sales_df.describe()

Unnamed: 0,ExciseTaxNbr,SalePrice,PropertyType,PrincipalUse,SaleInstrument,SaleReason,PropertyClass
count,2042156.0,2042156.0,2042156.0,2042156.0,2042156.0,2042156.0,2042156.0
mean,2040631.0,593355.7,3.191168,4.610459,7.345941,5.393875,6.511309
std,567580.1,6013766.0,3.962688,2.613934,6.611585,6.234883,2.482525
min,456583.0,-600.0,0.0,0.0,0.0,0.0,0.0
25%,1579805.0,0.0,3.0,2.0,3.0,1.0,3.0
50%,2052170.0,150000.0,3.0,6.0,3.0,1.0,8.0
75%,2522121.0,350000.0,3.0,6.0,15.0,10.0,8.0
max,3008118.0,739885000.0,99.0,11.0,28.0,19.0,9.0


### Seeing pink? Warnings are useful!

Note the warning above: `DtypeWarning: Columns (1, 2) have mixed types.` Because we start with an index of zero, the columns that we're being warned about are actually the _second_ and _third_ columns, `sales_df['Major']` and `sales_df['Minor']`.

In [8]:
sales_df.head().T

Unnamed: 0,0,1,2,3,4
ExciseTaxNbr,2687551,1235111,2704079,2584094,1056831
Major,138860,664885,423943,403700,951120
Minor,110,40,50,715,900
DocumentDate,08/21/2014,07/09/1991,10/11/2014,01/04/2013,04/20/1989
SalePrice,245000,0,0,0,85000
RecordingNbr,20140828001436,199203161090,20141205000558,20130110000910,198904260448
Volume,,071,,,117
Page,,001,,,053
PlatNbr,,664885,,,951120
PlatType,,C,,,P


### Data overload?

That's a lot of columns. We're only interested in identifying the date, sale price, and square footage of each specific property. What can we do?

In [9]:
small_sales_df = sales_df[['Major', 'Minor', 'DocumentDate', 'SalePrice']].copy()

In [10]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2042156 entries, 0 to 2042155
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 62.3+ MB


In [11]:
bldg_df = pd.read_csv('Residential Building.zip')

  interactivity=interactivity, compiler=compiler, result=result)


### Another warning! Which column has index 11?

In [12]:
bldg_df.columns[11]

'ZipCode'

`ZipCode` seems like a potentially useful column. We'll need it to determine which house sales took place in Seattle.

In [13]:
bldg_df.head().T

Unnamed: 0,0,1,2,3,4
Major,12850,12850,13000,13000,13300
Minor,110,310,50,135,121
BldgNbr,1,1,1,1,1
NbrLivingUnits,1,1,1,1,1
Address,210 JUNCTION BLVD 98001,306 JUNCTION BLVD 98001,9817 38TH AVE NE 98115,3905 NE 100TH ST 98125,10013 15TH AVE S 98168
BuildingNumber,210,306,9817,3905,10013
Fraction,,,,,
DirectionPrefix,,,,NE,
StreetName,JUNCTION,JUNCTION,38TH,100TH,15TH
StreetType,BLVD,BLVD,AVE,ST,AVE


### So many features!

As data scientists, we should be _very_ cautious about discarding potentially useful data. But, today, we're interested in _only_ the total square footage of each property. What can we do?


In [14]:
small_bldg_df = bldg_df[['Major', 'Minor', 'SqFtTotLiving', 'ZipCode']].copy()

In [15]:
small_bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513808 entries, 0 to 513807
Data columns (total 4 columns):
Major            513808 non-null int64
Minor            513808 non-null int64
SqFtTotLiving    513808 non-null int64
ZipCode          468547 non-null object
dtypes: int64(3), object(1)
memory usage: 15.7+ MB


In [16]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])
#where major and minor were the same

In [None]:
pd.merge?

In [17]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860,110,08/21/2014,245000,1490,98002
1,138860,110,06/12/1989,109300,1490,98002
2,138860,110,01/16/2005,14684,1490,98002
3,138860,110,06/08/2005,0,1490,98002
4,423943,50,10/11/2014,0,960,98092


### Error!

Why are we seeing an error when we try to join the dataframes?

<table>
    <tr>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2013160 entries, 0 to 2013159
Data columns (total 4 columns):
Major           object
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: int64(1), object(3)
memory usage: 61.4+ MB</pre></td>
        <td style="text-align:left"><pre>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 511359 entries, 0 to 511358
Data columns (total 4 columns):
Major            511359 non-null int64
Minor            511359 non-null int64
SqFtTotLiving    511359 non-null int64
ZipCode          468345 non-null object
dtypes: int64(3), object(1)
memory usage: 15.6+ MB
</pre></td>
    </tr>
</table>

Review the error message in light of the above:

* `ValueError: You are trying to merge on object and int64 columns.`

In [18]:
pd.to_numeric(sales_df['Major'])

ValueError: Unable to parse string "      " at position 934952

### Error!

Note the useful error message above:

`ValueError: Unable to parse string "      " at position 936643`

In this case, we want to treat non-numeric values as missing values. Let's see if there's a way to change how the `pd.to_numeric` function handles errors.

In [19]:
# The single question mark means "show me the docstring"
pd.to_numeric?

Here's the part that we're looking for:
```
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input
```

Let's try setting the `errors` parameter to `'coerce'`.

In [20]:
small_sales_df.loc[:,'Major'] = pd.to_numeric(sales_df['Major'], errors='coerce')

Did it work?

In [21]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2042156 entries, 0 to 2042155
Data columns (total 4 columns):
Major           float64
Minor           object
DocumentDate    object
SalePrice       int64
dtypes: float64(1), int64(1), object(2)
memory usage: 62.3+ MB


It worked! Let's do the same thing with the `Minor` parcel number.

In [22]:
small_sales_df['Minor'] = pd.to_numeric(sales_df['Minor'], errors='coerce')

In [23]:
small_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2042156 entries, 0 to 2042155
Data columns (total 4 columns):
Major           float64
Minor           float64
DocumentDate    object
SalePrice       int64
dtypes: float64(2), int64(1), object(1)
memory usage: 62.3+ MB


Now, let's try our join again.

In [24]:
sales_data = pd.merge(small_sales_df, small_bldg_df, on=['Major', 'Minor'])

In [25]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
0,138860.0,110.0,08/21/2014,245000,1490,98002
1,138860.0,110.0,06/12/1989,109300,1490,98002
2,138860.0,110.0,01/16/2005,14684,1490,98002
3,138860.0,110.0,06/08/2005,0,1490,98002
4,423943.0,50.0,10/11/2014,0,960,98092


In [26]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459095 entries, 0 to 1459094
Data columns (total 6 columns):
Major            1459095 non-null float64
Minor            1459095 non-null float64
DocumentDate     1459095 non-null object
SalePrice        1459095 non-null int64
SqFtTotLiving    1459095 non-null int64
ZipCode          1338134 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 77.9+ MB


We can see right away that we're missing zip codes for many of the sales transactions. (1321536 non-null entries for ZipCode is fewer than the 1436772 entries in the dataframe.) 

In [27]:
sales_data.loc[sales_data['ZipCode'].isna()].head(20)

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
242,334330.0,1343.0,05/30/2006,0,4600,
243,334330.0,1343.0,05/30/2006,0,4600,
244,334330.0,1343.0,11/26/2001,0,4600,
245,334330.0,1343.0,05/30/2006,0,4600,
246,334330.0,1343.0,06/30/2016,0,4600,
247,334330.0,1343.0,05/30/2006,0,4600,
248,334330.0,1343.0,08/18/2005,0,4600,
249,334330.0,1343.0,06/07/2007,915000,4600,
250,334330.0,1343.0,11/26/2001,403000,4600,
251,334330.0,1343.0,11/20/2014,880000,4600,


Because we are interested in finding houses in Seattle zip codes, we will need to drop the rows with missing zip codes.

In [34]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]
# ~ means negation; ~sales_data['ZipCode'].isna() means not null

In [35]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338134 entries, 0 to 1459094
Data columns (total 6 columns):
Major            1338134 non-null float64
Minor            1338134 non-null float64
DocumentDate     1338134 non-null object
SalePrice        1338134 non-null int64
SqFtTotLiving    1338134 non-null int64
ZipCode          1338134 non-null object
dtypes: float64(2), int64(2), object(2)
memory usage: 71.5+ MB


# Your turn: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

Use multiple notebook cells to accomplish this! Press `[esc]` then `B` to create a new cell below the current cell. Press `[return]` to start typing in the new cell.

In [36]:
sales_data = sales_data.loc[(sales_data['SalePrice'] > 0) & sales_data['SqFtTotLiving'] > 0]
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode
221,868228.0,510.0,10/19/2005,394995,1365,98053
222,868228.0,510.0,08/18/2006,425000,1365,98053
223,868228.0,510.0,10/07/2004,13014683,1365,98053
479,172204.0,9005.0,12/21/1999,562500,4623,98198
605,935840.0,210.0,06/10/2005,2025000,4311,98003


### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

What's the right thing to do with missing values?

In [38]:
# Read the error message and decide how to fix it.
# Note: using errors='coerce' is the *wrong* choice in this case.
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

pd.to_numeric(sales_data['ZipCode']).head()
# sales_data.ZipCode.value_counts()

221    98053.0
222    98053.0
223    98053.0
479    98198.0
605    98003.0
Name: ZipCode, dtype: float64

### 3. Add a column for PricePerSqFt



In [39]:
sales_data['PricePerSqFt'] = sales_data.SalePrice / sales_data.SqFtTotLiving

In [40]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
221,868228.0,510.0,10/19/2005,394995,1365,98053,289.373626
222,868228.0,510.0,08/18/2006,425000,1365,98053,311.355311
223,868228.0,510.0,10/07/2004,13014683,1365,98053,9534.5663
479,172204.0,9005.0,12/21/1999,562500,4623,98198,121.674238
605,935840.0,210.0,06/10/2005,2025000,4311,98003,469.728601


In [41]:
sales_data['PricePerSqFt'] = sales_data.PricePerSqFt.map(lambda x: round(x, 2))
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
221,868228.0,510.0,10/19/2005,394995,1365,98053,289.37
222,868228.0,510.0,08/18/2006,425000,1365,98053,311.36
223,868228.0,510.0,10/07/2004,13014683,1365,98053,9534.57
479,172204.0,9005.0,12/21/1999,562500,4623,98198,121.67
605,935840.0,210.0,06/10/2005,2025000,4311,98003,469.73


### 4. Subset the data to 2019 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [43]:
sales_data.loc[sales_data['DocumentDate'].str.endswith('2019')].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,SqFtTotLiving,ZipCode,PricePerSqFt
5272,152104.0,9136.0,01/24/2019,268000,1551,98001,172.79
13688,258790.0,90.0,08/05/2019,725000,3253,98042,222.87
27168,763240.0,285.0,08/17/2019,644000,1527,98166,421.74
29054,60300.0,570.0,06/20/2019,407000,1125,98118,361.78
31643,5330.0,310.0,03/01/2019,667950,2525,98058,264.53


In [48]:
pd.to_datetime(sales_data['DocumentDate']).dt.day.head()

221    19
222    18
223     7
479    21
605    10
Name: DocumentDate, dtype: int64

### 5. Subset the data to zip codes within the City of Seattle.

You'll need to find a list of Seattle zip codes!

### 6. What is the mean price per square foot for a house sold in Seattle in 2019?

Don't just type the answer. Type code that generates the answer as output!