<h2>Cleaning Data</h2>

<ul>
    <li>Identify missing data in the data frame</li>
    <li>Treat(delete or impute) missing values</li>
    <li>In Python, missing data is represented using either of the two objects NaN (Not a Number) or NULL.</li>
    <li>We'll not get into the differences between them and how Python stores them internally etc. </li>
    <li>We'll focus on studying ways to identify and treat missing values in Pandas dataframes.</li>  
    
</ul>

<p> There are four main methods to identify and treat missing data: </p>
<ul>
    <li>isnull(): Indicates presence of missing values, returns a boolean </li>
    <li>notnull(): Opposite of isnull(), returns a boolean</li>
    <li>dropna(): Drops the missing values from a data frame and returns the rest</li>
    <li>fillna(): Fills (or imputes) the missing values by a specified value</li>
</ul>

<h2> Identifying Missing Data </h2>
    <p> The methods isnull() and notnull() are the most common ways of identifying missing values.</p>
    <p> While handling missing data, you first need to identify the rows and columns containing missing values, count the number of missing values, and then decide how you want to treat them.</p>
    <p>It is important that you treat missing values in each column separately, rather than implementing a single solution (e.g. replacing NaNs by the mean of a column) for all columns.</p>
    <p>isnull() returns a boolean (True/False) which can then be used to find the rows or columns containing missing values.</p>

In [303]:
import numpy as np
import pandas as pd

dataframe_melbourne = pd.read_csv('D:/upgrad/melbourne.csv')
dataframe_melbourne.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04-02-2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0


In [304]:
#shape of the dataframe
print(dataframe_melbourne.shape)
#Infor of the dataframe
print(dataframe_melbourne.info)


(23547, 21)
<bound method DataFrame.info of                Suburb              Address  Rooms Type      Price Method  \
0          Abbotsford        68 Studley St      2    h        NaN     SS   
1          Abbotsford         85 Turner St      2    h  1480000.0      S   
2          Abbotsford      25 Bloomburg St      2    h  1035000.0      S   
3          Abbotsford   18/659 Victoria St      3    u        NaN     VB   
4          Abbotsford         5 Charles St      3    h  1465000.0     SP   
5          Abbotsford     40 Federation La      3    h   850000.0     PI   
6          Abbotsford          55a Park St      4    h  1600000.0     VB   
7          Abbotsford         16 Maugie St      4    h        NaN     SN   
8          Abbotsford         53 Turner St      2    h        NaN      S   
9          Abbotsford         99 Turner St      2    h        NaN      S   
10         Abbotsford       129 Charles St      2    h   941000.0      S   
11         Abbotsford         124 Yarra St  

In [305]:
#Identifying missing values
dataframe_melbourne.isnull()


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
9,False,False,False,False,True,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False


In [306]:
#identifying missing values as a sum

dataframe_melbourne.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Price             5151
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom2          4481
Bathroom          4484
Car               4626
Landsize          6137
BuildingArea     13529
YearBuilt        12007
CouncilArea       7891
Lattitude         4304
Longtitude        4304
Regionname           1
Propertycount        1
dtype: int64

In [308]:
#Columns having atleast one missing value
dataframe_melbourne.any()

# The above is equivalent to finding the missing values in axis=0

dataframe_melbourne.isnull().all(axis=0)

Suburb           False
Address          False
Rooms            False
Type             False
Price            False
Method           False
SellerG          False
Date             False
Distance         False
Postcode         False
Bedroom2         False
Bathroom         False
Car              False
Landsize         False
BuildingArea     False
YearBuilt        False
CouncilArea      False
Lattitude        False
Longtitude       False
Regionname       False
Propertycount    False
dtype: bool

In [24]:
# Check all the values in the rows are missing

dataframe_melbourne.isnull().all(axis=1)

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
23517    False
23518    False
23519    False
23520    False
23521    False
23522    False
23523    False
23524    False
23525    False
23526    False
23527    False
23528    False
23529    False
23530    False
23531    False
23532    False
23533    False
23534    False
23535    False
23536    False
23537    False
23538    False
23539    False
23540    False
23541    False
23542    False
23543    False
23544    False
23545    False
23546    False
Length: 23547, dtype: bool

In [25]:
dataframe_melbourne.isnull().all(axis=1).sum()

0

In [27]:
dataframe_melbourne.isnull().sum(axis=1)

0        3
1        2
2        0
3        3
4        0
5        2
6        0
7        1
8        2
9        2
10       2
11       0
12       1
13       1
14       0
15       9
16       9
17       2
18       0
19       9
20       1
21       9
22       9
23       2
24       0
25       0
26       7
27       9
28       2
29       2
        ..
23517    2
23518    3
23519    3
23520    2
23521    1
23522    4
23523    3
23524    3
23525    1
23526    2
23527    1
23528    5
23529    4
23530    5
23531    9
23532    1
23533    5
23534    2
23535    3
23536    4
23537    2
23538    1
23539    2
23540    2
23541    1
23542    2
23543    8
23544    4
23545    1
23546    2
Length: 23547, dtype: int64

In [59]:
import pandas as pd
df_test = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(df_test.isnull().sum())

Ord_id                   0
Prod_id                  0
Ship_id                  0
Cust_id                  0
Sales                   20
Discount                55
Order_Quantity          55
Profit                  55
Shipping_Cost           55
Product_Base_Margin    109
dtype: int64


<h2> Treating Missing Values </h2>

<p>There are broadly two ways to treat missing values:</p>

<oL>
    <li>Delete: Delete the missing values </li>
    <li>Impute:</li>
    <ul>        
   <li>Imputing by a simple statistic: Replace the missing values by another value, commonly the mean, median, mode etc.</li>
    <li>Predictive techniques: Use statistical models such as k-NN, SVM etc. to predict and impute missing values</li>
    </ul>
    </oL>

In [47]:
#Summing up the missing values (Column-Wise)
round(100*(dataframe_melbourne.isnull().sum()/len(dataframe_melbourne.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
BuildingArea     57.46
YearBuilt        50.99
CouncilArea      33.51
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

In [48]:
# We see theat BuildingArea, yearbuilt and Council Areaa are the one that can be removed from the dataframe.
dataframe_melbourne=dataframe_melbourne.drop("BuildingArea", axis=1)
dataframe_melbourne=dataframe_melbourne.drop("CouncilArea", axis=1)
dataframe_melbourne=dataframe_melbourne.drop("YearBuilt", axis=1)
round(100*(dataframe_melbourne.isnull().sum()/len(dataframe_melbourne.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

In [92]:
import numpy as np
import pandas as pd

data_frame_m =pd.read_csv("D:/upgrad/melbourne.csv")
round(100*(data_frame_m.isnull().sum()/len(data_frame_m.index)), 2)



Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
BuildingArea     57.46
YearBuilt        50.99
CouncilArea      33.51
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

In [93]:

data_frame_m = data_frame_m.drop("BuildingArea",axis=1)
data_frame_m = data_frame_m.drop("YearBuilt",axis=1)
data_frame_m = data_frame_m.drop("CouncilArea",axis=1)

round(100*(data_frame_m.isnull().sum()/len(data_frame_m.index)),2)


Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

In [97]:
data_frame_m[data_frame_m.isnull().sum(axis=1)>5]

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
15,Abbotsford,217 Langridge St,3,h,1000000.0,S,Jellis,08-10-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
16,Abbotsford,18a Mollison St,2,t,745000.0,S,Jellis,08-10-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
19,Abbotsford,403/609 Victoria St,2,u,542000.0,S,Dingle,08-10-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
21,Abbotsford,25/84 Trenerry Cr,2,u,760000.0,SP,Biggin,10-12-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
22,Abbotsford,106/119 Turner St,1,u,481000.0,SP,Purplebricks,10-12-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
27,Abbotsford,13/84 Trenerry Cr,1,u,500000.0,S,Biggin,12-11-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
33,Abbotsford,250 Langridge St,2,t,847000.0,S,Jellis,16-07-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
34,Abbotsford,16b Mollison St,2,h,,PI,Biggin,16-07-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
45,Abbotsford,65/80 Trenerry Cr,1,u,480000.0,S,Biggin,19-11-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0
46,Abbotsford,119/52 Nicholson St,1,u,423500.0,S,hockingstuart,22-05-2016,2.5,3067.0,,,,,,,Northern Metropolitan,4019.0


In [103]:
#Count the number of rows having > 5 missing values
# use len(df.index)
len(data_frame_m[data_frame_m.isnull().sum(axis=1)>5].index)

4278

In [108]:
# calculat the percentage
100*(len(data_frame_m[data_frame_m.isnull().sum(axis=1)>5].index)/len(data_frame_m.index))

18.16791948018856

In [114]:
# Retaining the rows which is < 5 NAN's
data_framem = data_frame_m[data_frame_m.isnull().sum(axis=1) <=5]
round(100*(data_framem.isnull().sum()/len(data_framem.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.71
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2          1.05
Bathroom          1.07
Car               1.81
Landsize          9.65
Lattitude         0.13
Longtitude        0.13
Regionname        0.00
Propertycount     0.00
dtype: float64

In [116]:
data_framem

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,2.0,1.0,1.0,126.0,-37.80140,144.99580,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,2.0,1.0,1.0,202.0,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,2.0,1.0,0.0,156.0,-37.80790,144.99340,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04-02-2016,2.5,3067.0,3.0,2.0,1.0,0.0,-37.81140,145.01160,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,0.0,134.0,-37.80930,144.99440,Northern Metropolitan,4019.0
5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,1.0,94.0,-37.79690,144.99690,Northern Metropolitan,4019.0
6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,04-06-2016,2.5,3067.0,3.0,1.0,2.0,120.0,-37.80720,144.99410,Northern Metropolitan,4019.0
7,Abbotsford,16 Maugie St,4,h,,SN,Nelson,06-08-2016,2.5,3067.0,3.0,2.0,2.0,400.0,-37.79650,144.99650,Northern Metropolitan,4019.0
8,Abbotsford,53 Turner St,2,h,,S,Biggin,06-08-2016,2.5,3067.0,4.0,1.0,2.0,201.0,-37.79950,144.99740,Northern Metropolitan,4019.0
9,Abbotsford,99 Turner St,2,h,,S,Collins,06-08-2016,2.5,3067.0,3.0,2.0,1.0,202.0,-37.79960,144.99890,Northern Metropolitan,4019.0


In [118]:
data_framem_noprice = data_framem[~np.isnan(data_framem["Price"])]
data_framem

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,2.0,1.0,1.0,126.0,-37.80140,144.99580,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,2.0,1.0,1.0,202.0,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,2.0,1.0,0.0,156.0,-37.80790,144.99340,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04-02-2016,2.5,3067.0,3.0,2.0,1.0,0.0,-37.81140,145.01160,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,0.0,134.0,-37.80930,144.99440,Northern Metropolitan,4019.0
5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,1.0,94.0,-37.79690,144.99690,Northern Metropolitan,4019.0
6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,04-06-2016,2.5,3067.0,3.0,1.0,2.0,120.0,-37.80720,144.99410,Northern Metropolitan,4019.0
7,Abbotsford,16 Maugie St,4,h,,SN,Nelson,06-08-2016,2.5,3067.0,3.0,2.0,2.0,400.0,-37.79650,144.99650,Northern Metropolitan,4019.0
8,Abbotsford,53 Turner St,2,h,,S,Biggin,06-08-2016,2.5,3067.0,4.0,1.0,2.0,201.0,-37.79950,144.99740,Northern Metropolitan,4019.0
9,Abbotsford,99 Turner St,2,h,,S,Collins,06-08-2016,2.5,3067.0,3.0,2.0,1.0,202.0,-37.79960,144.99890,Northern Metropolitan,4019.0


In [122]:
data_framem = data_frame_m[data_frame_m.isnull().sum(axis=1) <=5]
round(100*(data_framem.isnull().sum()/len(data_framem.index)), 2)





In [125]:
# Remove all the values of NAN from price as below.
data_framem_noprice = data_framem[~np.isnan(data_framem["Price"])]
round(100*(data_framem_noprice.isnull().sum()/len(data_framem_noprice.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         1.05
Bathroom         1.07
Car              1.76
Landsize         9.83
Lattitude        0.15
Longtitude       0.15
Regionname       0.00
Propertycount    0.00
dtype: float64

In [128]:
data_framem_noprice["Landsize"].describe()

count     13603.000000
mean        558.116371
std        3987.326586
min           0.000000
25%         176.500000
50%         440.000000
75%         651.000000
max      433014.000000
Name: Landsize, dtype: float64

In [130]:
# removing NAN from landsize
data_framem_nolandsize = data_framem_noprice[~np.isnan(data_framem_noprice["Landsize"])]
round(100*(data_framem_nolandsize.isnull().sum()/len(data_framem_nolandsize.index)),2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.46
Landsize         0.00
Lattitude        0.16
Longtitude       0.16
Regionname       0.00
Propertycount    0.00
dtype: float64

In [131]:
data_framem_nolandsize

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Regionname,Propertycount
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,2.0,1.0,1.0,202.0,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,2.0,1.0,0.0,156.0,-37.80790,144.99340,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,0.0,134.0,-37.80930,144.99440,Northern Metropolitan,4019.0
5,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,04-03-2017,2.5,3067.0,3.0,2.0,1.0,94.0,-37.79690,144.99690,Northern Metropolitan,4019.0
6,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,04-06-2016,2.5,3067.0,3.0,1.0,2.0,120.0,-37.80720,144.99410,Northern Metropolitan,4019.0
10,Abbotsford,129 Charles St,2,h,941000.0,S,Jellis,07-05-2016,2.5,3067.0,2.0,1.0,0.0,181.0,-37.80410,144.99530,Northern Metropolitan,4019.0
11,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,07-05-2016,2.5,3067.0,4.0,2.0,0.0,245.0,-37.80240,144.99930,Northern Metropolitan,4019.0
14,Abbotsford,98 Charles St,2,h,1636000.0,S,Nelson,08-10-2016,2.5,3067.0,2.0,1.0,2.0,256.0,-37.80600,144.99540,Northern Metropolitan,4019.0
17,Abbotsford,6/241 Nicholson St,1,u,300000.0,S,Biggin,08-10-2016,2.5,3067.0,1.0,1.0,1.0,0.0,-37.80080,144.99730,Northern Metropolitan,4019.0
18,Abbotsford,10 Valiant St,2,h,1097000.0,S,Biggin,08-10-2016,2.5,3067.0,3.0,1.0,2.0,220.0,-37.80100,144.99890,Northern Metropolitan,4019.0


In [137]:
# Now the missing columns with NAN can be easily be impute for latitude and longitude.
data_framem_nolandsize.loc[:,['Lattitude','Longtitude']].describe()



Unnamed: 0,Lattitude,Longtitude
count,13581.0,13581.0
mean,-37.809204,144.995221
std,0.079257,0.103913
min,-38.18255,144.43181
25%,-37.85682,144.9296
50%,-37.80236,145.0001
75%,-37.7564,145.05832
max,-37.40853,145.52635


In [141]:
# Since the variation between the latitude and longtitude is very narrow we can actually impute the mean value

data_framem_nolandsize.loc[np.isnan(data_framem_nolandsize['Lattitude']),['Lattitude']] = data_framem_nolandsize['Lattitude'].mean()
data_framem_nolandsize.loc[np.isnan(data_framem_nolandsize['Longtitude']),['Longtitude']] = data_framem_nolandsize['Longtitude'].mean()

round(100*(data_framem_nolandsize.isnull().sum()/len(data_framem_nolandsize.index)), 2)


Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.46
Landsize         0.00
Lattitude        0.00
Longtitude       0.00
Regionname       0.00
Propertycount    0.00
dtype: float64

In [144]:
# Impute for the car data and bathroom
data_framem_nolandsize.loc[:,["Car","Bathroom"]].describe()


Unnamed: 0,Car,Bathroom
count,13540.0,13602.0
mean,1.610414,1.534921
std,0.962244,0.691834
min,0.0,0.0
25%,1.0,1.0
50%,2.0,1.0
75%,2.0,2.0
max,10.0,8.0


In [146]:
#Converting the carprks to category
data_framem_nolandsize['Car']=data_framem_nolandsize['Car'].astype('category')
data_framem_nolandsize['Car'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


2.0     5606
1.0     5515
0.0     1026
3.0      748
4.0      507
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

In [149]:
data_framem_nolandsize.loc[pd.isnull(data_framem_nolandsize['Car']),['Car']]=2
round(100*(data_framem_nolandsize.isnull().sum()/len(data_framem_nolandsize.index)), 2)

Suburb           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Date             0.00
Distance         0.00
Postcode         0.00
Bedroom2         0.00
Bathroom         0.01
Car              0.00
Landsize         0.00
Lattitude        0.00
Longtitude       0.00
Regionname       0.00
Propertycount    0.00
dtype: float64

In [150]:
# With the above the only column that is left to deal with is bathroom
data_framem_nolandsize['Bathroom']=data_framem_nolandsize['Bathroom'].astype('category')
data_framem_nolandsize['Bathroom'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


1.0    7517
2.0    4987
3.0     921
4.0     106
0.0      34
5.0      28
6.0       5
8.0       2
7.0       2
Name: Bathroom, dtype: int64

In [152]:
# We can impute the bathrooms with 1 as majority has 1 bathrooms

data_framem_nolandsize.loc[pd.isnull(data_framem_nolandsize['Bathroom']),['Bathroom']] =1
round(100*(data_framem_nolandsize.isnull().sum()/len(data_framem_nolandsize.index)), 2)

Suburb           0.0
Address          0.0
Rooms            0.0
Type             0.0
Price            0.0
Method           0.0
SellerG          0.0
Date             0.0
Distance         0.0
Postcode         0.0
Bedroom2         0.0
Bathroom         0.0
Car              0.0
Landsize         0.0
Lattitude        0.0
Longtitude       0.0
Regionname       0.0
Propertycount    0.0
dtype: float64

In [155]:
data_framem_nolandsize.shape

(13603, 18)

In [158]:
(len(data_framem_nolandsize.index)/23547)*100
#Lost around 32% of it. but it is still good enough data for the ML algorithm 

57.769567248481756

In [161]:
import pandas as pd
df_test_cv = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(df_test_cv.isnull().sum()/len(df_test_cv.index)),2)#Round off percentage values to 2 decimial places.

Ord_id                 0.0
Prod_id                0.0
Ship_id                0.0
Cust_id                0.0
Sales                  0.0
Discount               0.0
Order_Quantity         0.0
Profit                 0.0
Shipping_Cost          0.0
Product_Base_Margin    0.0
dtype: float64 2


In [167]:
print(round(100*(df_test_cv.isnull().sum()/len(df_test_cv.index)),2))

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64


In [168]:
df_test_cv

Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.8100,0.01,23.0,-30.51,3.60,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.2700,0.01,13.0,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.6900,0.00,26.0,1148.90,2.50,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.8900,0.09,43.0,729.34,14.30,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.1500,0.08,35.0,1219.87,26.30,0.38
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.0200,0.03,23.0,-47.64,6.15,0.37
6,Ord_31,Prod_12,SHP_41,Cust_26,14.7600,0.01,5.0,1.32,0.50,0.36
7,Ord_4725,Prod_4,SHP_6593,Cust_1641,3410.1575,0.10,48.0,1137.91,0.99,0.55
8,Ord_4725,Prod_13,SHP_6593,Cust_1641,162.0000,0.01,33.0,45.84,0.71,0.52
9,Ord_4725,Prod_6,SHP_6593,Cust_1641,57.2200,0.07,8.0,-27.72,6.60,0.37


In [187]:
import pandas as pd
df_test_cv = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(df_test_cv.isnull().sum()/len(df_test_cv.index)),2)

Ord_id                 0.0
Prod_id                0.0
Ship_id                0.0
Cust_id                0.0
Sales                  0.0
Discount               0.0
Order_Quantity         0.0
Profit                 0.0
Shipping_Cost          0.0
Product_Base_Margin    0.0
dtype: float64 2


In [192]:
df_test_refined = df_test_cv[df_test_cv.isnull().sum(axis=1) < 5]


Unnamed: 0,Ord_id,Prod_id,Ship_id,Cust_id,Sales,Discount,Order_Quantity,Profit,Shipping_Cost,Product_Base_Margin
0,Ord_5446,Prod_16,SHP_7609,Cust_1818,136.8100,0.01,23.0,-30.51,3.60,0.56
1,Ord_5406,Prod_13,SHP_7549,Cust_1818,42.2700,0.01,13.0,4.56,0.93,0.54
2,Ord_5446,Prod_4,SHP_7610,Cust_1818,4701.6900,0.00,26.0,1148.90,2.50,0.59
3,Ord_5456,Prod_6,SHP_7625,Cust_1818,2337.8900,0.09,43.0,729.34,14.30,0.37
4,Ord_5485,Prod_17,SHP_7664,Cust_1818,4233.1500,0.08,35.0,1219.87,26.30,0.38
5,Ord_5446,Prod_6,SHP_7608,Cust_1818,164.0200,0.03,23.0,-47.64,6.15,0.37
6,Ord_31,Prod_12,SHP_41,Cust_26,14.7600,0.01,5.0,1.32,0.50,0.36
7,Ord_4725,Prod_4,SHP_6593,Cust_1641,3410.1575,0.10,48.0,1137.91,0.99,0.55
8,Ord_4725,Prod_13,SHP_6593,Cust_1641,162.0000,0.01,33.0,45.84,0.71,0.52
9,Ord_4725,Prod_6,SHP_6593,Cust_1641,57.2200,0.07,8.0,-27.72,6.60,0.37


In [197]:
df_test_refined = df_test_cv.loc[df_test_cv.isnull().sum(axis=1) <=5]
round(100*(df_test_refined.isnull().sum()/len(df_test_refined.index)), 2)

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.00
Discount               0.42
Order_Quantity         0.42
Profit                 0.42
Shipping_Cost          0.42
Product_Base_Margin    1.06
dtype: float64

In [198]:
import pandas as pd
df_test_refined = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df_test_refined = df_test_refined[df_test_refined.isnull().sum(axis=1) <=5]
print(round(100*(df_test_refined.isnull().sum()/len(df_test_refined.index)), 2))#Round off to 2 decimal places.

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.00
Discount               0.42
Order_Quantity         0.42
Profit                 0.42
Shipping_Cost          0.42
Product_Base_Margin    1.06
dtype: float64


In [203]:
import pandas as pd
import numpy as np

weather_dateframe=pd.read_csv("D:/upgrad/weather-data-in-new-york-city-2016/weather_data_nyc_centralpark_2016.csv")
weather_dateframe.columns


Index(['date', 'maximum temperature', 'minimum temperature',
       'average temperature', 'precipitation', 'snow fall', 'snow depth'],
      dtype='object')

In [207]:
weather_dateframe.max()

date                   9-9-2016
maximum temperature          96
minimum temperature          81
average temperature        88.5
precipitation                 T
snow fall                     T
snow depth                    T
dtype: object

In [210]:
weather_dateframe.shape

(366, 7)

In [215]:
rows, columns = weather_dateframe.shape
rows, columns # Prints a tuple

(366, 7)

In [217]:
weather_dateframe.head()

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.0,0.0,0
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0
3,4-1-2016,36,14,25.0,0.0,0.0,0
4,5-1-2016,29,11,20.0,0.0,0.0,0


In [220]:
weather_dateframe['date']weather_dateframe['snow fall']

SyntaxError: invalid syntax (<ipython-input-220-9ca1ad0f9fd8>, line 1)

In [223]:
weather_dateframe[:3]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.0,0.0,0
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0


In [252]:
#Find the dates where the snowfall is maximum
weather_dateframe[['date','snow fall']][weather_dateframe['snow fall']==weather_dateframe['snow fall'].max()]

Unnamed: 0,date,snow fall
11,12-1-2016,T
13,14-1-2016,T
17,18-1-2016,T
23,24-1-2016,T
39,9-2-2016,T
40,10-2-2016,T
41,11-2-2016,T
53,23-2-2016,T
79,20-3-2016,T
93,3-4-2016,T


In [281]:
#weather_dateframe_snow=weather_dateframe[['date','snow fall']][weather_dateframe['snow fall']==weather_dateframe['snow fall'].max()]
weather_dateframe[['date','snow fall']][weather_dateframe['snow fall']==weather_dateframe['snow fall'].max()]

Unnamed: 0,date,snow fall
11,12-1-2016,T
13,14-1-2016,T
17,18-1-2016,T
23,24-1-2016,T
39,9-2-2016,T
40,10-2-2016,T
41,11-2-2016,T
53,23-2-2016,T
79,20-3-2016,T
93,3-4-2016,T


In [280]:
weather_dateframe_snow.set_index('date', inplace=True)
weather_dateframe_snow.loc['9-2-2016']

snow fall    T
Name: 9-2-2016, dtype: object

In [287]:
for i in range(10,16):
    
    if((i%3==0)|(i%5==0)):
        if((i%3==0)&(i%5==0)):
            print("FizzBuzz")
        elif((i%3==0)):
            print("Fizz")
        elif(i%5==0):
            print("Buzz")
    else:
        print(i)
    

Buzz
11
Fizz
13
14
FizzBuzz


In [299]:
from collections import defaultdict

def firstNonRepeating(arr, n):
    mp=defaultdict(lambda:0)
    for i in range(n):
        mp[arr[i]]+= 1
        
    for i in range(n):
        if(mp[arr[i]] == 1):
            return arr[i]
        
    return -1
    

arr_test = [2, 2, 7, 7, 1, 1, 3, 9, 9]
print(type(arr_test))
n= len(arr_test)
print(n)

print(firstNonRepeating(arr_test,n))




<class 'list'>
9
3
