### Here outlier will be caught and removed by *percentile* and *quantile* function.
- Outlier can be seen very easily if we have domain knowledge.Suppose, you know apartment rent in your area. So when you see a data set having prices of your area your can easily find the outliers.In my area the 3BHK apartment rent will not be more than 30k and will not be less than 15k for sure.
- What if you know nothing about the domain knowledge? first see the mean value, 75% percentile values, and maximum values from EDD(Extended Data Dictionary). if the maximum is unacceptably high for a certain column then apply quantile function for 99% and 1% (maximum and minimum threshold) values. Then find the values beyond 99% and 1%, and remove them.

# USING *quantile*  FUNCTION:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import matplotlib
matplotlib.rc('figure',figsize=(10,5))
plt.style.use('ggplot')

In [2]:
df= pd.read_csv('height.csv')

In [3]:
df

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9
5,khusbu,5.4
6,dmitry,6.2
7,selena,6.5
8,john,7.1
9,imran,14.5


In [4]:
df['height']

0      5.9
1      5.2
2      5.1
3      5.5
4      4.9
5      5.4
6      6.2
7      6.5
8      7.1
9     14.5
10     6.1
11     5.6
12     1.2
13     5.0
Name: height, dtype: float64

In [7]:
max_threshold = df['height'].quantile(.95)
# maximum threshold means the greatest possible amount in reality for anything
# the value of maximum threshold is the 95 percentile value.
# all the value must be under 95 percentile.
# the values beyond 95 percentile are outliers.
max_threshold

9.689999999999998

In [8]:
# the values beyond maximum thresholds are:
df[df.height>max_threshold]

Unnamed: 0,name,height
9,imran,14.5


The height of imran is 14.5 ft. which can't be possible for anyone. it's an error.

In [10]:
# we can also detect an outlier in minimum end.
# minimum threshold:an amount which is the smallest that is possible/allowed/required.
min_threshold = df.height.quantile(.05)
min_threshold

3.6050000000000004

In [11]:
df[df.height<min_threshold]

Unnamed: 0,name,height
12,yoseph,1.2


The height of yoseph is 1.2. which is not possible for any human being.

In [17]:
# the values which can be considered are:
df[(df.height<max_threshold) & (df.height>min_threshold)]

Unnamed: 0,name,height
0,mohan,5.9
1,maria,5.2
2,sakib,5.1
3,tao,5.5
4,virat,4.9
5,khusbu,5.4
6,dmitry,6.2
7,selena,6.5
8,john,7.1
10,jose,6.1


# BANGLAORE PROPERTY PRICE

In [18]:
df2 = pd.read_csv('bangalore_property_price.csv')

In [19]:
df2.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2,3699
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4,4615
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3,4305
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3,6245
4,Kothanur,2 BHK,1200.0,2.0,51.0,2,4250


In [20]:
df2.columns

Index(['location', 'size', 'total_sqft', 'bath', 'price', 'bhk',
       'price_per_sqft'],
      dtype='object')

In [21]:
df2.shape

(13200, 7)

In [27]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   location        13200 non-null  object 
 1   size            13200 non-null  object 
 2   total_sqft      13200 non-null  float64
 3   bath            13200 non-null  float64
 4   price           13200 non-null  float64
 5   bhk             13200 non-null  int64  
 6   price_per_sqft  13200 non-null  int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 722.0+ KB


In [24]:
df2.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13200.0,13200.0,13200.0,13200.0,13200.0
mean,1555.302783,2.691136,112.276178,2.800833,7920.337
std,1237.323445,1.338915,149.175995,1.292843,106727.2
min,1.0,1.0,8.0,1.0,267.0
25%,1100.0,2.0,50.0,2.0,4267.0
50%,1275.0,2.0,71.85,3.0,5438.0
75%,1672.0,3.0,120.0,3.0,7317.0
max,52272.0,40.0,3600.0,43.0,12000000.0


### PRIMARY OBSERVATION:
- SOME OF THE HOUSE'S PRICE PER SQUARE FEET IS 120000000 rupee. or 1 crore 20 lakhs rupee per square feet. while the average is 7920 rupee per sqft. This impossible. for this we have to find out the thresholds value.
- we don't have any missing value.

In [32]:
maximum_threshold = df2.price_per_sqft.quantile(.999)
maximum_threshold 

50959.36200000098

In [33]:
minimum_threshold = df2.price_per_sqft.quantile(.001)
minimum_threshold

1366.184

In [35]:
#lets check data points beyond maximum threshold number
df2[df2.price_per_sqft>maximum_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
345,other,3 Bedroom,11.0,3.0,74.0,3,672727
1005,other,1 BHK,15.0,1.0,30.0,1,200000
1106,other,5 Bedroom,24.0,2.0,150.0,5,625000
4044,Sarjapur Road,4 Bedroom,1.0,4.0,120.0,4,12000000
4924,other,7 BHK,5.0,7.0,115.0,7,2300000
5911,Mysore Road,1 Bedroom,45.0,1.0,23.0,1,51111
6356,Bommenahalli,4 Bedroom,2940.0,3.0,2250.0,4,76530
7012,other,1 BHK,650.0,1.0,500.0,1,76923
7575,other,1 BHK,425.0,1.0,750.0,1,176470
7799,other,4 BHK,2000.0,3.0,1063.0,4,53150


#### It's impossible to be the price per square feet more than 6 lakhs for 3BHK.

In [36]:
#lets check data points beyond minimum threshold number
df2[df2.price_per_sqft<minimum_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
665,Yelahanka,3 BHK,35000.0,3.0,130.0,3,371
798,other,4 Bedroom,10961.0,4.0,80.0,4,729
1867,other,3 Bedroom,52272.0,2.0,140.0,3,267
2392,other,4 Bedroom,2000.0,3.0,25.0,4,1250
3934,other,1 BHK,1500.0,1.0,19.5,1,1300
5343,other,9 BHK,42000.0,8.0,175.0,9,416
5417,Ulsoor,4 BHK,36000.0,4.0,450.0,4,1250
5597,JP Nagar,2 BHK,1100.0,1.0,15.0,2,1363
7166,Yelahanka,1 Bedroom,26136.0,1.0,150.0,1,573
7862,JP Nagar,3 BHK,20000.0,3.0,175.0,3,875


#### It's impossible to be 371 rupees per sqft in Bangalore. 
#### All these values beyond maximum and minimum thresholds are unacceptable. We have to remove these data for final analysis.

In [42]:
#our new dataset for bangalore house price:
df3 = df2[(df2.price_per_sqft<maximum_threshold) & (df2.price_per_sqft>minimum_threshold)]

In [43]:
df3.shape

(13172, 7)

In [45]:
df3.sample(10)

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
5160,Devarachikkanahalli,2 BHK,1230.0,2.0,58.0,2,4715
2380,other,1 Bedroom,800.0,1.0,100.0,1,12500
1474,5th Phase JP Nagar,2 BHK,1207.0,2.0,63.0,2,5219
10687,other,3 BHK,1626.6,3.0,133.0,3,8176
6379,other,2 BHK,1150.0,2.0,48.0,2,4173
8374,Bannerghatta Road,4 BHK,3230.0,5.0,165.0,4,5108
7769,Whitefield,3 BHK,1991.0,3.0,104.0,3,5223
11125,Jalahalli,2 BHK,1244.0,2.0,88.0,2,7073
6069,Hebbal,3 BHK,1987.0,4.0,165.0,3,8303
4688,Koramangala,2 BHK,1320.0,2.0,150.0,2,11363


#### RESULT: after removing the outliers we can find pretty decent values.

# Using *percentile* function:

In [72]:
#np.percentile(df2.bath,[70])[0]
#np.percentile(df2.bath,[80])[0]
#np.percentile(df2.bath,[90])[0]
#np.percentile(df2.bath,[95])[0]
np.percentile(df2.bath,[99])[0]

8.0

- 70% apartment have 3 bathrooms.
- 80% apartment have 3 bathrooms.
- 90% apartment have 4 bathrooms.
- 95% apartment have 5 bathrooms.
- 99% apartment have 8 bathrooms.

In [80]:
maxim_threshold = np.percentile(df2.bath,[99])[0]

In [81]:
maxim_threshold

8.0

In [82]:
minim_threshold = np.percentile(df2.bath,[1])[0]

In [83]:
minim_threshold

1.0

In [84]:
# the values beyond maximum threshold and minimum threshold are error. 
df2[df2.bath>maxim_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
45,HSR Layout,8 Bedroom,600.0,9.0,200.0,8,33333
454,other,11 BHK,5000.0,9.0,360.0,11,7200
533,Mico Layout,9 BHK,5000.0,9.0,210.0,9,4200
760,other,9 Bedroom,600.0,9.0,190.0,9,31666
925,5th Phase JP Nagar,9 Bedroom,1260.0,11.0,290.0,9,23015
...,...,...,...,...,...,...,...
12948,other,10 Bedroom,7150.0,13.0,3600.0,10,50349
13022,other,9 BHK,4600.0,9.0,150.0,9,3260
13100,Laggere,7 Bedroom,1590.0,9.0,132.0,7,8301
13102,other,9 Bedroom,1178.0,9.0,75.0,9,6366


In [85]:
df2[df2.bath<minim_threshold]

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft


## CONCLUSION:
1. the *percentile* function is not good, as it can't get uss .9999 percentile values. here, we saw 40 bathrooms are maximum values. *percentile* function couldn't get us there.


In [90]:
bathroom_number = df2.bath

In [98]:
count = bathroom_number[bathroom_number > 10].count()

In [99]:
count

20