## Housing price in Beijing
Housing price of Beijing from 2011 to 2017, fetching from Lianjia.com

Informasi Kolom : 

- **url** : the url which fetches the data
- **id** : the id of transaction
- **Lng** and **Lat** :  coordinates, using the BD09 protocol.
- **Cid** : community id
- **tradeTime** : the time of transaction
- **DOM** : active days on market.Know more in https://en.wikipedia.org/wiki/Days_on_market
- **followers** : the number of people follow the transaction.
- **totalPrice**: the total price
- **square** : the square of house m2
- **livingRoom** : the number of living room
- **drawingRoom** : the number of drawing room
- **kitchen** : the number of kitchen
- **bathroom** :  the number of bathroom
- **floor** : the height of the house. I will turn the Chinese characters to English in the next version.
- **buildingType** : including tower( 1 ) , bungalow( 2 )，combination of plate and tower( 3 ), plate( 4 ).
- **constructionTime** : the time of construction
- **renovationCondition** : including other( 1 ), rough( 2 ),Simplicity( 3 ), hardcover( 4 )
- **buildingStructure** : including unknow( 1 ), mixed( 2 ), brick and wood( 3 ), brick and concrete( 4 ),steel( 5 ) and steel-concrete composite ( 6 ).
- **ladderRatio** : the proportion between number of residents on the same floor and number of elevator of ladder. It describes how many ladders a resident have on average.
- **elevator** : have ( 1 ) or not have elevator( 0 )
- **fiveYearsProperty** : if the owner have the property for less than 5 years,

4 kolom ini tidak terdeskripsi pada informasi kolom :
- **Distrik** : These data have values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. I assume that this number represents a particular district, so this column will be preserved
- **communityAverage** : From the results of tracing in this column, I assume ** communityAverage ** is the average population in an area, this column is related to the same ** lat **, ** lng ** / location contains the same ** communityAverage ** value.
- **subway** : This data has a value of [0.1] I assume, 1 is * have *, and 0 is * no *


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/lianjia/new.csv', encoding= 'unicode_escape')

In [None]:
df

In [None]:
df.drop('price',inplace=True,axis=1)

In [None]:
df.head()

In [None]:
DataDesc = []
for i in df.columns:
    DataDesc.append([
        i,
        df[i].dtypes,
        df[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        df[i].nunique(),
        df[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

### I. Inspection Column I (Checking related Column and Column Data)

#### - District, Lng, Lat, Cid, dan CommunityAverage Column

These five columns have a relationship, namely each district has several Cid (community id) or area, and each Cid has an average population value.

In [None]:
## sample 1
df[(df['Lng'] == 116.475489) & (df['Lat']==40.019520)][['Lng','Lat','Cid','communityAverage','district']].head()

###### - Floor Column
The value in the column ** floor ** is unicode, previously unicode was in Chinese, I assumed it might be units, because there is no definite information regarding the unicode, so I assume the value is in meters.
so that this data is easy to analyze and modeling, I will delete the unicode and take only the numeric values, and change the data type to be an integer.

In [None]:
df['floor'].unique()

In [None]:
splt_1 = []
splt_2 = []
for dfloor in df['floor']:
    x = len(dfloor.split())
    if x == 1 : 
        splt_1.append(dfloor)
    else :
        splt_2.append(dfloor)

In [None]:
#splt1
splt_1

In [None]:
# splt2
splt_2

- I found anomalies in the data, during the data cleaning process, in * splt_1 * the data was all in unicode form, so I would fill in the data with nan values.

In [None]:
def floor_clean(col):
    x = len(col['floor'].split())
    if x == 1:
        return np.nan
    else :
        return col['floor'].split()[1]

In [None]:
df['floor'] = df[['floor']].apply(floor_clean,axis=1)

In [None]:
df['floor'] = df['floor'].astype(float)  

In [None]:
df['floor']

- **integer** data type does not accept nan value, therefore temporarily change data type to ** float **.

##### - TradeTime Column

Change the data type of tradeTime column from object to date time

In [None]:
df['tradeTime'] = pd.to_datetime(df['tradeTime'])

In [None]:
df['tradeTime']

#### - ConstructionTime Column
In the **constructionTime** column, there are 3 invalid data, therefore the data will be changed to *nan value*

In [None]:
df['constructionTime'].unique()

In [None]:
def cl_ct(col):
    if col['constructionTime'] == 'Î´Öª':
        return np.nan
    elif col['constructionTime'] == '1':
        return np.nan
    elif col['constructionTime'] == '0':
        return np.nan
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['constructionTime']].apply(cl_ct,axis=1)

In [None]:
df['constructionTime'] 

##### - LivingRoom, drawingRoom, bathRoom Columns
Previously, all of these columns were object data types, because they contained numeric values and the information explained the amount and size, the data type would be changed to numeric Integer / Float.

##### - livingRoom

In [None]:
# Living Room
df['livingRoom'].unique()

In [None]:
len(df[df['livingRoom']=='#NAME?'])

In [None]:
def cl_lv(col):
    if col['livingRoom'] == '#NAME?':
        return np.nan
    else:
        return int(col['livingRoom'])

In [None]:
df['livingRoom'] = df[['livingRoom']].apply(cl_lv,axis=1)

In [None]:
# Living Room
df['livingRoom'].unique()

Found invalid data namely '#NAME?' as many as 32 data, then the data will be converted into a *nan* value

##### - drawingRoom

In [None]:
df[df['drawingRoom']=='µ× 28']

In [None]:
df['drawingRoom'].unique()

In [None]:
def cl_dr(col):
    tmp = str(col['drawingRoom'])
    x = len(tmp.split())
    if x == 2 :
        return int(col['drawingRoom'].split()[1])
    else:
        return int(col['drawingRoom'])

In [None]:
df[['drawingRoom']].apply(cl_dr,axis=1).unique()

In [None]:
df['drawingRoom']= df[['drawingRoom']].apply(cl_dr,axis=1)

In [None]:
df['drawingRoom'] = df['drawingRoom'].astype(int)

Found data in the form of unicode, namely: 'ÖÐ 14', 'ÖÐ 15', 'ÖÐ 16', 'ÖÐ 6', '¸ß 14', '¶ ¥ 6', 'µÍ 6', 'µÍ 16', ' ¸ß 12 ',' µÍ 15 ',' 5 ',' ¸ß 6 ',' µ × 28 ',' µ × 11 ',' ÖÐ 24 ',' µ × 20 ',' ÖÐ 22 ', I assume Unicode is Chinese, therefore I will just take the number and delete the unicode, besides that there are some numbers that are of type data string, I will convert them to integers for this column.

##### - bathRoom

In [None]:
df['bathRoom'].unique()

In [None]:
len(df[df['bathRoom'] == 'Î´Öª'])

In [None]:
def cl_bt(col): 
    if col['bathRoom'] == 'Î´Öª':
        return np.nan
    else:
        return int(col['bathRoom'])

In [None]:
df['bathRoom'] = df[['bathRoom']].apply(cl_bt,axis=1)

In [None]:
df['bathRoom'].unique()

- Found 2 invalid data in the form of unicode 'Î´Öª', and I changed it to nan value

From the three columns above, we have changed the object's data type to numeric (float / int). there are missing values in the livingRoom, bathRoom, floor columns, this happens because there is data in the form of unicode and the value is changed to nan.

In [None]:
df.info()

## II. Changing Numeric Categorical Values To Strings.

In [None]:
df_new = df.copy()

##### - buildingType Column

The values in the buildingStructure column previously were numbers 1-4, will be changed to:
- including tower( 1 ) ,
- bungalow( 2 )，
- combination of plate and tower( 3 ),
- plate( 4 )

In [None]:
df['buildingType'].unique()

In [None]:
btype = [0.5,0.333,0.125,0.25,0.429,0.048,0.375,0.667] 
idxbtype=[]
for data in btype:
    x = df[df['buildingType']==data].index
    y = list(x)
    for data2 in y:
        idxbtype.append(data2)

In [None]:
len(idxbtype)

In [None]:
df.iloc[idxbtype][['buildingStructure','drawingRoom']]

In [None]:
df['buildingStructure'].unique()

In [None]:
df[df['buildingStructure']==0]['buildingType'].value_counts()

In this process an invalid building type value was found, namely (0.5,0.333,0.125,0.25,0.429,0.048,0.375,0.667), the value in the building type column should be a category with numbers 1 - 4, the details are as follows:
- kategori including tower( 1 ) , 
- bungalow( 2 )，
- combination of plate and tower( 3 ), 
- plate( 4 ), 

I tried to see the relationship between the building type (invalid value) and the building structure, it was found that the data on the building structure was the same as invalid, the majority of the building type data with an invalid value had a building structure value of 0, in the category 0 data information in the building column The structure is not categorized as anything, the following is the building structure column data information:

- including unknow( 1 ),
- mixed( 2 ), 
- brick and wood( 3 ), 
- brick and concrete( 4 ),
- steel( 5 ) and 
- steel-concrete composite ( 6 ).

Therefore I will delete the building type and building structure data that are not valid.

In [None]:
## Delete data by index
for drop_idx in idxbtype:
    df.drop(drop_idx,inplace=True)

In [None]:
# check unique
df['buildingType'].unique()

In [None]:
df[df['buildingType']==0.333]

In [None]:
# buildingtype with a value of 0.333 is not detected??
df[df['buildingType'].isnull()].index

In [None]:
df['buildingType'] = df['buildingType'].map({
    1.0 : 'including tower',
    2.0 : 'bungalow',
    3.0 : 'combination of plate and tower',
    4.0 : 'plate'
})

In [None]:
df[df['buildingType'].isnull()]

0.333 data because it adds 5 nan data

##### - buildingStructure Column

The values in the previous buildingStructure column in the form of numbers 1-6 will be changed to, as follows:
- including unknow( 1 ),
- mixed( 2 ), 
- brick and wood( 3 ), 
- brick and concrete( 4 ),
- steel( 5 ) and 
- steel-concrete composite ( 6 ).

In [None]:
index_bs = []
for idx in df[df['buildingStructure']==0].index:
    index_bs.append(idx)

In [None]:
index_bs

In [None]:
## Delete data by index
df.drop(92251,inplace=True)
df.drop(92267,inplace=True)
df.drop(92304,inplace=True)
df.drop(92356,inplace=True)

In [None]:
df['buildingStructure'].unique()

In [None]:
df['buildingStructure'] = df['buildingStructure'].map({
    1: 'including unknow',
    2: 'mixed',
    3: 'brick and wood',
    4: 'brick and concrete',
    5: 'steel',
    6: 'steel-concrete composite'
})

##### - Renovation Condition

The values in the buildingStructure column previously were numbers 1-4, will be changed to:
- 1:'including other',
- 2:'rough',
- 3:'Simplicity',
- 4:'hardcover'

In [None]:
df['renovationCondition'].unique()

In [None]:
df['renovationCondition'] = df['renovationCondition'].map({
    1:'including other',
    2:'rough',
    3:'Simplicity',
    4:'hardcover'
})

##### -  District Column
because there is no clear information about the district so I will value it in alphabetical form

In [None]:
df['district'].unique()

In [None]:
df['district'] = df['district'].map({
    1:'A',
    2:'B',
    3:'C',
    4:'D',
    5:'E',
    6:'F',
    7:'G',
    8:'H',
    9:'I',
    10:'J',
    11:'K',
    12:'L',
    13: 'M',  
})

##### - Elevator Column
- Based on the information, **elevator** 1 column has *have* and 0 *not have*

In [None]:
df['elevator'] = df['elevator'].map({
    0:'not have',
    1:'have'
})

##### - subway column
- No specific information, personal assumption of 1 is * yes * 0 is * no *

In [None]:
df['subway'] = df['subway'].map({
    0:'no',
    1:'yes'
})

##### - Kolom fiveYearsProperty 
- Based on the information, column fiveYearsProperty is (owner has property for less than 5 years), then the value * 0 * is no and 1 is * yes

In [None]:
df['fiveYearsProperty'] = df['fiveYearsProperty'].map({
    0:'no',
    1:'yes'
})

In [None]:
dfa = df.copy()

## III. Fill in the Missing Value

In [None]:
DataDesc = []
for i in df.columns:
    DataDesc.append([
        i,
        df[i].dtypes,
        df[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        df[i].nunique(),
        df[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

In [None]:
# df_corr.select_dtypes(include='object')

In [None]:
# plt.figure(figsize=(20,15))
# sns.heatmap(df.corr(),annot=True)

##### - constructionTime column

To fill in the missing values in the constructionTime column, I set the buildingType and renovationCondition fields as parameters. my assumption:
- usually the factor that influences the type of building is time, for example in the classic 90's and 2000's modern
- Besides the condition of renovation, the longer the building, the higher the level of renovation

- **Start**

In [None]:
df[df['constructionTime'].isnull()][['constructionTime','buildingType','renovationCondition']]

In [None]:
#1 sample
df[(df['buildingType']=='plate') & (df['renovationCondition']=='including other')]['constructionTime'].value_counts()

In [None]:
#2 
df[(df['buildingType']=='plate') & (df['renovationCondition']=='hardcover')]['constructionTime'].value_counts()

In [None]:
#3 
df[(df['buildingType']=='plate') & (df['renovationCondition']=='Simplicity')]['constructionTime'].value_counts()

In [None]:
#4 dist
df[(df['buildingType']=='including tower') & (df['renovationCondition']=='Simplicity')]['constructionTime'].value_counts()

In [None]:
def fill_missval(col):
    if pd.isna(col['constructionTime']): 
        if (col['buildingType']=='plate') & (col['renovationCondition']=='including other'):
            return 2004
        elif (col['buildingType']=='plate') & (col['renovationCondition']=='hardcover'):
            return 2003
        elif (col['buildingType']=='plate') & (col['renovationCondition']=='Simplicity'):
            return 1995
        elif (col['buildingType']=='including tower') & (col['renovationCondition']=='Simplicity'):
            return 2000
        else:
            return col['constructionTime']
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['buildingType','renovationCondition','buildingStructure','constructionTime']].apply(fill_missval,axis=1)

- **Checkpoint 1**

In [None]:
df[df['constructionTime'].isnull()][['constructionTime','buildingType','renovationCondition']]

In [None]:
#5 
df[(df['buildingType']=='including tower') & (df['renovationCondition']=='hardcover')]['constructionTime'].value_counts()

In [None]:
#6 
df[(df['buildingType']=='plate') & (df['renovationCondition']=='rough')]['constructionTime'].value_counts()

In [None]:
#7 
df[(df['buildingType']=='combination of plate and tower') & (df['renovationCondition']=='hardcover')]['constructionTime'].value_counts()

In [None]:
#8 
df[(df['buildingType']=='combination of plate and tower') & (df['renovationCondition']=='including other')]['constructionTime'].value_counts()

In [None]:
#9 
df[(df['buildingType']=='including tower') & (df['renovationCondition']=='including other')]['constructionTime'].value_counts()

In [None]:
def fill_missval2(col):
    if pd.isna(col['constructionTime']):
        if (col['buildingType']=='including tower') & (col['renovationCondition']=='hardcover'):
            return 2004
        elif (col['buildingType']=='plate') & (col['renovationCondition']=='rough'):
            return 2012
        elif (col['buildingType']=='combination of plate and tower') & (col['renovationCondition']=='hardcover'):
            return 2007
        elif (col['buildingType']=='combination of plate and tower') & (col['renovationCondition']=='including other'):
            return 2005
        elif (col['buildingType']=='including tower') & (col['renovationCondition']=='including other'):
            return 2004
        else:
            return col['constructionTime']
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['buildingType','renovationCondition','buildingStructure','constructionTime']].apply(fill_missval2,axis=1)

- **Checkpoint 2**

In [None]:
df[df['constructionTime'].isnull()][['constructionTime','buildingType','renovationCondition']]

In [None]:
#9 
df[(df['buildingType']=='combination of plate and tower') & (df['renovationCondition']=='Simplicity')]['constructionTime'].value_counts()

In [None]:
def fill_missval3(col):
    if pd.isna(col['constructionTime']):
        if (col['buildingType']=='combination of plate and tower') & (col['renovationCondition']=='Simplicity'):
            return 2006
        else:
            return col['constructionTime']
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['buildingType','renovationCondition','buildingStructure','constructionTime']].apply(fill_missval3,axis=1)

- **Checkpoint 3**

In [None]:
tmpdf = df[df['constructionTime'].isnull()][['constructionTime','buildingType','renovationCondition']].copy()

In [None]:
tmpdf.dropna(subset=['buildingType']).drop_duplicates()

In [None]:
#10
df[(df['buildingType']=='combination of plate and tower') & (df['renovationCondition']=='rough')]['constructionTime'].value_counts()

In [None]:
#11
df[(df['buildingType']=='bungalow') & (df['renovationCondition']=='rough')]['constructionTime'].value_counts()

In [None]:
#12
df[(df['buildingType']=='bungalow') & (df['renovationCondition']=='including other')]['constructionTime'].value_counts()

In [None]:
#13
df[(df['buildingType']=='including tower') & (df['renovationCondition']=='rough')]['constructionTime'].value_counts()

In [None]:
#14
df[(df['buildingType']=='bungalow') & (df['renovationCondition']=='Simplicity')]['constructionTime'].value_counts()

In [None]:
#15
df[(df['buildingType']=='bungalow') & (df['renovationCondition']=='hardcover')]['constructionTime'].value_counts()

In [None]:
def fill_missval4(col):
    if pd.isna(col['constructionTime']):
        if (col['buildingType']=='combination of plate and tower') & (col['renovationCondition']=='rough'):
            return 2012
        elif (col['buildingType']=='bungalow') & (col['renovationCondition']=='rough'):
            return 1970
        elif (col['buildingType']=='bungalow') & (col['renovationCondition']=='including other'):
            return 1980
        elif (col['buildingType']=='including tower') & (col['renovationCondition']=='rough'):
            return 2012
        elif (col['buildingType']=='bungalow') & (col['renovationCondition']=='Simplicity'):
            return 1988
        elif (col['buildingType']=='bungalow') & (col['renovationCondition']=='hardcover'):
            return 2010
        else:
            return col['constructionTime']
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['buildingType','renovationCondition','buildingStructure','constructionTime']].apply(fill_missval4,axis=1)

- **Checkpoint 5**

In [None]:
tmpdf2 = df[df['constructionTime'].isnull()][['constructionTime','buildingType','renovationCondition']]

In [None]:
tmpdf2.drop_duplicates()

In [None]:
#16
df[df['renovationCondition']=='rough']['constructionTime'].value_counts()

In [None]:
#17
df[df['renovationCondition']=='including other']['constructionTime'].value_counts()

In [None]:
#18
df[df['renovationCondition']=='hardcover']['constructionTime'].value_counts()

In [None]:
#19
df[df['renovationCondition']=='Simplicity']['constructionTime'].value_counts()

In [None]:
def fill_missval5(col):
    if pd.isna(col['constructionTime']):
        if col['renovationCondition']=='rough':
            return 2012
        elif col['renovationCondition']=='including other':
            return 2004
        elif col['renovationCondition']=='hardcover':
            return 2003
        elif col['renovationCondition']=='Simplicity':
            return 1995
        else:
            return col['constructionTime']
    else:
        return col['constructionTime']

In [None]:
df['constructionTime'] = df[['buildingType','renovationCondition','buildingStructure','constructionTime']].apply(fill_missval5,axis=1)

- Done

In [None]:
DataDesc = []
for i in df.columns:
    DataDesc.append([
        i,
        df[i].dtypes,
        df[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        df[i].nunique(),
        df[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

#### - subway, fiveYearsProperty,  elevator , livingRoom,  and floor columns

In these 3 columns, the missing values occur at the same index, because there is only 1 value, I decided to delete the index.

In [None]:
df[df['subway'].isnull()][['subway','fiveYearsProperty','elevator','livingRoom','floor']]

In [None]:
df.drop(244054,inplace=True)

In [None]:
DataDesc = []
for i in df.columns:
    DataDesc.append([
        i,
        df[i].dtypes,
        df[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        df[i].nunique(),
        df[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

In [None]:
dfnew = df.copy()

###### - communityAverage column

For filling in the missing values in the ** communityAverage ** column, I see a pattern when the ** Cid ** column is the same, so the ** communityAverage ** value is the same. but after checking there are 67.2% of the 463 data ** Cid ** that do not have a value ** communityAverage **, besides that I try to display the same column that has a null value, namely ** buildingType ** and ** DOM ** after checking the missing value ** communityAverage ** has a missing value also in the ** buildingType ** and ** DOM ** columns, amounting to 64% (totaled from the two columns) of 463. I decided to delete all the missing data in the column In addition, removing 0.15% of the total data will not have much effect on the results of the analysis.

In [None]:
### sample
dfnew[(dfnew['Lng']==116.475489) & (dfnew['Lat']==40.019520)][['Cid','communityAverage']]

In [None]:
dfcidnull = dfnew[dfnew['communityAverage'].isnull()][['Cid','communityAverage']].copy()
dfcidnull

In [None]:
#### Menyimpan
idxca = dfnew[dfnew['communityAverage'].isnull()][['Cid']].index

In [None]:
#### Menyimpan
val = dfnew[dfnew['communityAverage'].isnull()][['Cid']].values

In [None]:
tmpval = []
for p in val:
    tmpval.append(int(p))

In [None]:
tmpcid = []
tmpval2 =[]
for ca in tmpval:
    tmpval2.append(float(dfnew[dfnew['Cid']==ca][['communityAverage']][:1].values[0]))
    tmpcid.append(ca)
    

In [None]:
### values 
dfcid = pd.DataFrame({
    'Cid': tmpcid,
    'communityAverage': tmpval2
})

In [None]:
dfcid

In [None]:
## Cid column information that has a value
len(dfcid.dropna(subset=['communityAverage']))

In [None]:
# Cid column information that does not have a communityAverage value
len(dfcid[dfcid['communityAverage'].isnull()])

In [None]:
# relationship with the same column has nan values
dfbd =  dfnew.iloc[idxca][['buildingType','DOM']]
dfbd

In [None]:
# Missing values from the DOM and buildingtype columns
len(dfbd[(dfbd['DOM'].isnull()) | (dfbd['buildingType'].isnull())])

In [None]:
# Deletes data based on the 'Community Average' column which has nan values
dfnew.dropna(subset=['communityAverage'],inplace=True)

In [None]:
DataDesc = []
for i in dfnew.columns:
    DataDesc.append([
        i,
        dfnew[i].dtypes,
        dfnew[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        dfnew[i].nunique(),
        dfnew[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

In [None]:
dfa = dfnew.copy()

##### - buildingType column

I assume to fill in the missing values in the buildingType column by looking at the pattern / condition in the constructionTime column, 'buildingStructure. After tracing there is a condition **constructionTime**, **buildingStructure** which does not have a buildingtype value, besides that the risk of losing data is 0.63% of the total data, it will not really affect the analysis results, so I decided to delete it.

In [None]:
dfa[dfa['buildingType'].isnull()][['buildingType','constructionTime','buildingStructure']]

In [None]:
dfa[dfa['buildingType'].isnull()][['buildingType','constructionTime','buildingStructure']].drop_duplicates()

In [None]:
#sample 1
dfa[(dfa['constructionTime']==2006) & (dfa['buildingStructure']=='steel-concrete composite')][['buildingType']].value_counts()

In [None]:
# sample 2
dfa[(dfa['constructionTime']==2004) & (dfa['buildingStructure']=='steel-concrete composite')][['buildingType']].value_counts()

In [None]:
#sample
dfa[(dfa['constructionTime']==2003) & (dfa['buildingStructure']=='brick and concrete')][['buildingType']].value_counts()

In [None]:
p = dfa[dfa['buildingType'].isnull()][['buildingType','constructionTime','buildingStructure']].drop_duplicates()
p

In [None]:
tmpct=[]
for ct in p['constructionTime'].values:
    tmpct.append(ct)

In [None]:
tmpbs=[]
for bs in p['buildingStructure'].values:
    tmpbs.append(bs)

In [None]:
p = pd.DataFrame({
    'constructionTime' : tmpct,
    'buildingStructure' : tmpbs
})
p

In [None]:
bdtype = []
for dt in range(len(tmpbs)):
    bdtype.append(dfa[(dfa['constructionTime']==tmpct[dt]) & (dfa['buildingStructure']==tmpbs[dt])][['buildingType']].value_counts()[:1])

In [None]:
bdtype2 = []
for dt2 in bdtype:
    bdtype2.append(dt2.index)

In [None]:
bdtype2

In [None]:
pd.DataFrame({
    'constructionTime' : tmpct,
    'buildingStructure' : tmpbs,
    'buildingType' : bdtype
})

In [None]:
# menghapus data 
dfa.dropna(subset=['buildingType'],inplace=True)

###### - Kolom DOM

Missing values in the **DOM** column, based on my search for related columns, and personal assumptions will be filled in by taking the mean **DOM** based on **district**

In [None]:
DataDesc = []
for i in dfa.columns:
    DataDesc.append([
        i,
        dfa[i].dtypes,
        dfa[i].isna().sum(),
        round(df[i].isna().sum()/len(df)*100,2),
        dfa[i].nunique(),
        dfa[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

In [None]:
dfa['district'].unique()

In [None]:
for ij in dfa['district'].unique():
    sns.distplot(dfa[dfa['district']==ij]['DOM'],label=ij)
    plt.show()

In the graph, it can be concluded that the DOM distribution based on districts is not normally distributed, so I will fill in the missing values with the median value.

In [None]:
median = []
district = []
for i in dfa['district'].unique():
    median.append(dfa[dfa['district']==i]['DOM'].median())
    district.append(i)
pd.DataFrame({
    'district':district,
    'median':median
})

In [None]:
def fill_dom(col):
    if pd.isna(col['DOM']):
        if col['district']=='G':
            return 8
        elif col['district']=='F':
            return 9
        elif col['district']=='A':
            return 5
        elif col['district']=='M':
            return 2
        elif col['district']=='J':
            return 5
        elif col['district']=='B':
            return 4
        elif col['district']=='H':
            return 6
        elif col['district']=='D':
            return 6
        elif col['district']=='E':
            return 18
        elif col['district']=='C':
            return 9
        elif col['district']=='I':
            return 5
        elif col['district']=='L':
            return 1
        elif col['district']=='K':
            return 1
    else:
        return col['DOM']

In [None]:
dfa['DOM'] = dfa[['district','DOM']].apply(fill_dom,axis=1)

In [None]:
DataDesc = []
for i in dfa.columns:
    DataDesc.append([
        i,
        dfa[i].dtypes,
        dfa[i].isna().sum(),
        round(dfa[i].isna().sum()/len(df)*100,2),
        dfa[i].nunique(),
        dfa[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

### III. Conducting Inspection Column II (Checking Column Data Types)
- Due to missing data so that there are some columns whose data types cannot be changed, therefore I checked again.

In [None]:
dfa['floor'] = dfa['floor'].astype(int)
dfa['bathRoom'] = dfa['bathRoom'].astype(int)
dfa['livingRoom'] = dfa['livingRoom'].astype(int)
dfa['DOM'] = dfa['DOM'].astype(int)

The columns above are in the form of categories and number of days, so I changed the data type to be an integer

In [None]:
DataDesc = []
for i in dfa.columns:
    DataDesc.append([
        i,
        dfa[i].dtypes,
        dfa[i].isna().sum(),
        round(dfa[i].isna().sum()/len(df)*100,2),
        dfa[i].nunique(),
        dfa[i].sample(2).values
    ])
df_dsc= pd.DataFrame(DataDesc, columns=['dataFeatures','dataType','null','nullPct','unique','uniqueSample'])
df_dsc.sort_values(by='null',ascending=False).reset_index(drop=True)

In [None]:
dfa.to_csv('clean_data')

In [None]:
dfa.info()