# Week1_3_Data Management_2

Not all datasets are well-cleaned and ready-to-be used. Actually, the perfect situation is very rare. In data analysis of real-world data, we will meet a range of issues, such as missing data and data type inconsistency. Today, we will take a look at several common issues we will meet in data management, and we will learn how to deal with it. 

We will continue to use the Beijing resale housing dataset in 2012 as an example. 

In [10]:
# We are going to start importing the libraries we need
# In the future, it is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.

import pandas as pd

# 1. Data cleaning
As you might have already seen, when we work with data, the initial dataset is not always in a shape where we can use it as is. 

Sometimes column names are misspelled or unclear, there may be missing values, or the format of each column is incorrect. Moreoever you may also have noticed that often we can extract information from columns that might make them easier to work with. All these steps can be considered part of a data cleaning process, where we get the dataset ready to be used more effectively for our analysis purposes. 


## 1.1 Getting the data

In [51]:
# load the housing dataset using relative path
df_2012 = pd.read_excel("HouseBeijing2012.xlsx")

In [52]:
# check the column names
df_2012.columns

Index(['HouseID', 'CommunityID', 'TotalPrice', 'TransYear', 'Bedroom',
       'Livingroom', 'Bathroom', 'Size', 'FloorLevel', 'WinSouth',
       'WinSouthNorth', 'Decoration', 'TotalFloor', 'BuiltYear', 'Elevation',
       'Heating', 'TransMonth', 'TransDay', 'District', 'DistName',
       'CensusTract', 'XIAOQUWEB', 'SchQuality', 'NumSubway1km', 'Dist2Subway',
       'HospQuality', 'Dist2Hosp', 'NumHosp1km', 'NumBus200m', 'Dist2CBD',
       'Dist2Center'],
      dtype='object')

An interesting question asked by city planners and urban economists is the preference of homebuyers for the relative height of floors. 
   - High-floor dwelling units offer many benefits, such as reduced exposure to traffic pollution and noise, decreased security risks, scenic views, and emotional superiority for being higher up relative to others
   - However, the disadvantages are also apparent, including the vertical commuting costs and the potential hazards of top floors in case of rain leakages, fire, or other emergencies. 
   
So, now we want to explore the relationship between the relative floor level and unit housing price.

In the dataset, we have a column named FloorLevel, which should have 5 levels: from 1 to 5. The meaning of these numbers are:
- 1: the ground floor - the very first floor of a building;
- 2: the low floor - the bottom third of a building's total number of floors
- 3: the middle floor - the middle third of a building's total number of floors
- 4: the high floor - the top third of a building's total number of floors
- 5: the top floor - the highest floor

Now let us check this column:

In [53]:
df_2012[["FloorLevel"]]

Unnamed: 0,FloorLevel
0,bottom floor
1,
2,3
3,3
4,4
...,...
4995,bottom floor
4996,3
4997,3
4998,2


Two issues in this column: 
   - Instead of numeric numbers, this column also contains `NaN` value, suggesting the value is missing 
   - The column also contains string (text) values - **top floor** and **bottom floor**, which should be given a value 5 and 1, respectively.

## 1.2 Assessing Data Types
We have no idea about other columns. So, one of the next things we'll check is the data type for each column to make sure that they are in the right format.

In [54]:
df_2012.dtypes

HouseID           object
CommunityID        int64
TotalPrice       float64
TransYear          int64
Bedroom            int64
Livingroom         int64
Bathroom           int64
Size             float64
FloorLevel        object
WinSouth           int64
WinSouthNorth      int64
Decoration         int64
TotalFloor         int64
BuiltYear          int64
Elevation          int64
Heating            int64
TransMonth         int64
TransDay           int64
District           int64
DistName          object
CensusTract        int64
XIAOQUWEB         object
SchQuality         int64
NumSubway1km       int64
Dist2Subway      float64
HospQuality        int64
Dist2Hosp        float64
NumHosp1km         int64
NumBus200m         int64
Dist2CBD         float64
Dist2Center      float64
dtype: object

Other columns looks okay. Even if there are columns that goods problematic, I would not necessarily change the data types for all columns (especially when there are a lot), **just the ones that you might potentially need**. 

Now, it seems the only problematic column is `FloorLevel` 

## 1.3 Replacing Data

Let us first check the column in more details. 
   - `unique()`: obtain the unique values
   - `value_counts()`: check how many times each unique value appears

In [55]:
df_2012['FloorLevel'].unique()

array(['bottom floor', nan, '3', '4', '2', 'top floor'], dtype=object)

In [56]:
df_2012['FloorLevel'].value_counts(dropna=False)

3               1839
4               1139
2               1071
top floor        590
bottom floor     358
NaN                3
Name: FloorLevel, dtype: int64

As we can find, we have 590 "top floor" observations that need to be assigned value 5; 358 "bottom floor" observations that need to be assigned value 1; We have three NaN values that we can either drop them or fill them with some values. 

Let us deal with the string values first. 
   - replace "top floor" with value 5
   - replace "bottom floor" with value 1
   
   

We went over replacing data last week. There are actually a few ways to do this: 
   - `df.replace(to_replace=old_value, value=new_value)`

In [57]:
## Warning: inplace=True will modify the original column!
df_2012['FloorLevel'].replace('top floor', '5', inplace=True) 
df_2012['FloorLevel'].replace('bottom floor', '1', inplace=True) 

In [58]:
# Let us check the data again: 
df_2012['FloorLevel'].unique()

array(['1', nan, '3', '4', '2', '5'], dtype=object)

We can also use 
   - `df.loc[df['column_name'] == some_value, 'column_name'] = new_value`

In [None]:
# df_2012.loc[df_2012['FloorLevel'] == 'top floor','FloorLevel'] = 5

### 1.4 Null values in pandas. 

There are two main ways to represent the absence of values in a cell in Pandas: 
- `None` means a missing entry, but it's not a numeric type. 
- `NaN` is used by Pandas for representing missing data in numeric columns.

There are a few ways of handling missing data.


### 1.4.1 Removing rows 

We can remove those rows with data missing from a column that we are planning to use in our analysis. 

Here we are going to use the `isna()` function to check if the `FloorLevel` column has a `NaN`
   - `isna()` returns a boolean (True or False) for each row and we are going to use that boolean to filter the dataframe.

In [59]:
# check the rows with NaN in FloorLevel
df_2012[df_2012['FloorLevel'].isna()==True]

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,XIAOQUWEB,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center
1,BJCP84958845,2606,1800066.0,2012,3,2,2,129.0,,0,...,https://bj.lianjia.com/xiaoqu/1111027380050/,0,0,2284.0939,9,9154.80958,0,0,18298.50637,18632.22305
5,BJFT85228189,2768,1280012.6,2012,1,1,1,46.1,,1,...,https://bj.lianjia.com/xiaoqu/1111027380520/,0,0,1648.56576,8,1249.75505,0,1,16997.95563,11497.33703
14,BJHD84911905,948,1950030.0,2012,2,1,1,56.4,,1,...,https://bj.lianjia.com/xiaoqu/1111027376013/,1,2,553.71836,9,606.02436,1,0,14631.09681,9240.23407


In [60]:
# We are going to keep only the rows where the data_year column is not a NaN
df_2012 = df_2012[df_2012['FloorLevel'].isna()==False]

### 1.4.2 Replacing missing data

We can also replace the missing data with certain values: 
- We can replace the data with the mean of the non-NaN column values, for numerical values. (For instance, if our columns were something like "adult heights", then replacing the NaN with the mean values in the columns would allow us to leave the sample mean unchanged, which might be good for regression purposes). 
- We can also replace with the median (if you think there are outliers in the sample that might be skewing the mean)
- Replacing with the mode (most frequent value) would make more sense if we think that there's some default value 

**What would you do here?**

In [61]:
# This gets the mode of the data_year column
mode_FloorLevel = df_2012['FloorLevel'].mode()
mode_FloorLevel

0    3
dtype: object

In [62]:
# This fills the NaNs with the mode using the fillna() function
# fillna() is a method that fills in missing values with a value of your choice

df_2012['FloorLevel'].fillna(mode_FloorLevel)

0       1
2       3
3       3
4       4
6       2
       ..
4995    1
4996    3
4997    3
4998    2
4999    5
Name: FloorLevel, Length: 4997, dtype: object

In [63]:
# Now write over the old data_year column with the new one
df_2012['FloorLevel'] = df_2012['FloorLevel'].fillna(mode_FloorLevel)

Let us check the data type of the FloorLevel

In [64]:
df_2012.dtypes

HouseID           object
CommunityID        int64
TotalPrice       float64
TransYear          int64
Bedroom            int64
Livingroom         int64
Bathroom           int64
Size             float64
FloorLevel        object
WinSouth           int64
WinSouthNorth      int64
Decoration         int64
TotalFloor         int64
BuiltYear          int64
Elevation          int64
Heating            int64
TransMonth         int64
TransDay           int64
District           int64
DistName          object
CensusTract        int64
XIAOQUWEB         object
SchQuality         int64
NumSubway1km       int64
Dist2Subway      float64
HospQuality        int64
Dist2Hosp        float64
NumHosp1km         int64
NumBus200m         int64
Dist2CBD         float64
Dist2Center      float64
dtype: object

## 1.5 Changing data types
Notice that we changed everything in "FloorLevel" to numbers, but it's still showing up as an `object`.  Now let's try to change the data type for `FloorLevel`. 

- `.astype()` changes your column types for a particular column. 


In [65]:
## What I've done here is replace the old `FloorLevel` column with 
## a version of it that is an int
df_2012['FloorLevel'] = df_2012['FloorLevel'].astype(int)

In [75]:
df_2012.dtypes

HouseID           object
CommunityID        int64
TotalPrice       float64
TransYear          int64
Bedroom            int64
Livingroom         int64
Bathroom           int64
Size             float64
FloorLevel         int32
WinSouth           int64
WinSouthNorth      int64
Decoration         int64
TotalFloor         int64
BuiltYear          int64
Elevation          int64
Heating            int64
TransMonth         int64
TransDay           int64
District           int64
DistName          object
CensusTract        int64
XIAOQUWEB         object
SchQuality         int64
NumSubway1km       int64
Dist2Subway      float64
HospQuality        int64
Dist2Hosp        float64
NumHosp1km         int64
NumBus200m         int64
Dist2CBD         float64
Dist2Center      float64
UnitPrice        float64
dtype: object

In [78]:
df_2012

Unnamed: 0,HouseID,CommunityID,TotalPrice,TransYear,Bedroom,Livingroom,Bathroom,Size,FloorLevel,WinSouth,...,SchQuality,NumSubway1km,Dist2Subway,HospQuality,Dist2Hosp,NumHosp1km,NumBus200m,Dist2CBD,Dist2Center,UnitPrice
0,BJFT84326414,1544,1400010.56,2012,2,1,1,69.68,1,0,...,0,2,633.24007,9,1803.02071,0,1,9345.20091,7396.31505,20092.0
2,BJDX84905788,2264,1350038.34,2012,2,1,1,88.83,3,1,...,0,1,667.21572,8,11158.05983,0,4,22480.82065,20105.06770,15198.0
3,BJFT00386624,3621,1800006.91,2012,2,1,1,98.69,3,0,...,0,1,939.29061,9,1698.79101,0,10,16309.85203,11427.48851,18239.0
4,BJCY84713854,1127,1970019.58,2012,1,1,1,53.66,4,0,...,0,3,476.28267,9,938.35742,2,0,8105.90581,7213.87518,36713.0
6,BJCY84112518,2767,2730093.60,2012,3,2,2,132.08,2,1,...,0,1,692.53062,9,5122.56097,0,0,11524.68492,16967.87510,20670.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,BJTJ84718789,3360,1030038.24,2012,2,1,1,81.84,1,1,...,0,0,1669.04965,9,948.21718,1,0,15849.26185,21363.35490,12586.0
4996,BJFT84287006,736,1300028.16,2012,1,1,1,58.12,3,1,...,0,1,770.89444,9,777.47225,1,1,6221.44255,6278.89885,22368.0
4997,BJDC84781079,200,2190056.96,2012,2,1,1,58.24,3,1,...,2,1,715.50723,9,2053.92372,0,0,3779.07141,3505.28158,37604.0
4998,BJSJ85075781,2133,1500007.54,2012,1,1,1,63.86,2,1,...,0,1,863.10517,9,413.21398,2,2,21094.05154,15576.85068,23489.0


In [79]:
# Export the dataset
df_2012.to_excel("E:/CRP_3850_summer2024/Week1_3_DataManagement/HouseBeijing2012_clean.xlsx")

## 1.6 Does relative floor level matter?

In [68]:
# First calculate the UnitPrice. UnitPrice = TotalPrice/Size
df_2012["UnitPrice"] = df_2012["TotalPrice"] / df_2012["Size"]

In [69]:
df_2012.groupby("FloorLevel").mean()[["UnitPrice"]]

Unnamed: 0_level_0,UnitPrice
FloorLevel,Unnamed: 1_level_1
1,25564.284916
2,25485.943044
3,24902.871669
4,24908.994732
5,23542.227119


In [70]:
df_2012.columns

Index(['HouseID', 'CommunityID', 'TotalPrice', 'TransYear', 'Bedroom',
       'Livingroom', 'Bathroom', 'Size', 'FloorLevel', 'WinSouth',
       'WinSouthNorth', 'Decoration', 'TotalFloor', 'BuiltYear', 'Elevation',
       'Heating', 'TransMonth', 'TransDay', 'District', 'DistName',
       'CensusTract', 'XIAOQUWEB', 'SchQuality', 'NumSubway1km', 'Dist2Subway',
       'HospQuality', 'Dist2Hosp', 'NumHosp1km', 'NumBus200m', 'Dist2CBD',
       'Dist2Center', 'UnitPrice'],
      dtype='object')

In [73]:
df_2012.groupby("FloorLevel").mean()[["UnitPrice"]]

Unnamed: 0_level_0,UnitPrice
FloorLevel,Unnamed: 1_level_1
1,25564.284916
2,25485.943044
3,24902.871669
4,24908.994732
5,23542.227119
