# Outliers
![](https://miro.medium.com/max/2400/1*F_yiILIE954AZPgPADx76A.png)

Outliers are extreme values that deviate from other observations on data , they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample.

# Type of outlier:

**1. Univerient**
Outliers can be of two kinds: univariate and multivariate. Univariate outliers can be found when looking at a distribution of values in a single feature space. 

**2. Multivarient**
Multivariate outliers can be found in a n-dimensional space (of n-features). Looking at distributions in n-dimensional spaces can be very difficult for the human brain, that is why we need to train a model to do it for us.

Outliers can also come in different flavours, depending on the environment: point outliers, contextual outliers, or collective outliers. 


**3. Point Outlier**
Point outliers are single data points that lay far from the rest of the distribution.

**4. Contextual outliers**
Contextual outliers can be noise in data, such as punctuation symbols when realizing text analysis or background noise signal when doing speech recognition.


 

# Most common causes of outliers on a data set:
* Data entry errors (human errors)
* Measurement errors (instrument errors)
* Experimental errors (data extraction or experiment planning/executing errors)
* Intentional (dummy outliers made to test detection methods)
* Data processing errors (data manipulation or data set unintended mutations)
* Sampling errors (extracting or mixing data from wrong or various sources)
* Natural (not an error, novelties in data)

# In this Notbook we will see:

**1. Outlier Detection Stratgies for diffrent-diffrent destribution**

    1.1 For Normal or Gaussian Destribution

    1.2 For Non Gaussian Destribution or For Skewed distributions
   
**2. How to Remove or Replace Outlier**

    2.1 Caping

    2.2 Trimming

    2.3 Treat outliers as a missing value

    2.4 Discretization or Binning
       
**3. Technique to handle Outliers**

Outiler model are based on:

Probabilistic and Statistical Modeling (parametric)

Linear Regression Models (PCA, LMS)

Proximity Based Models (non-parametric)

Information Theory Models

High Dimensional Outlier Detection Methods (high dimensional sparse data)

Here are Few Algo we are going to discous:

    3.1 Z-Score or Extreme Value Analysis (parametric)

    3.2 Isolation Forest

    3.3 DBSCAN
    
    3.4 Elliptical Envelop

    3.5 One class SVM
    
    3.6 Local Outlier Factor

    3.7 Boxplot (IQR based)

![](https://miro.medium.com/max/670/1*OIXCo35Vvzr9qoUbV2fHMA.png)

# 1. Outlier Detection Stratgies for diffrent-diffrent destribution

# 1.1 For Normal or Gaussian Destribution
Identifying an observation as an outlier depends on the underlying distribution of the data. In this section, we limit the discussion to univariate data sets that are assumed to follow an approximately normal distribution. If the normality assumption for the data being tested is not valid, then a determination that there is an outlier may in fact be due to the non-normality of the data rather than the prescence of an outlier.
For this reason, it is recommended that you generate a normal probability plot of the data before applying an outlier test. Although you can also perform formal tests for normality, the prescence of one or more outliers may cause the tests to reject normality when it is in fact a reasonable assumption for applying the outlier test.

In addition to checking the normality assumption, the lower and upper tails of the normal probability plot can be a useful graphical technique for identifying potential outliers. In particular, the plot can help determine whether we need to check for a single outlier or whether we need to check for multiple outliers.

The box plot and the histogram can also be useful graphical tools in checking the normality assumption and in identifying potential outliers.

Use empirical relations of Normal distribution.

– The data points which fall below mean-3*(sigma) or above mean+3*(sigma) are outliers.

where mean and sigma are the average value and standard deviation of a particular column.

![](https://miro.medium.com/max/679/0*y7kVHEQPQKBg3Cga.)

good ‘thumb-rule’ z score (how to calculate z score value desecribed in below section) thresholds can be: 2.5, 3, 3.5 or more standard deviations.

# 1.2 For Non Gaussian Destribution and For Skewed distributions

In case of Non Gaussian Destribution one way is to transform the data into normal destribution,or you can Use Inter-Quartile Range (IQR) proximity rule.

– The skewed data points which fall below Q1 – 1.5 IQR or above Q3 + 1.5 IQR are outliers.

where Q1 and Q3 are the 25th and 75th percentile of the dataset respectively, and IQR represents the inter-quartile range and given by Q3 – Q1.
and also for other destribution we can go with the percentile value:

**For Skewed distributions:**

Use Inter-Quartile Range (IQR) proximity rule.

![](https://naysan.ca/wp-content/uploads/2020/06/box_plot_ref_needed.png)

**For other destribution :**

Use percentile-based approach.

![](https://acutecaretesting.org/-/media/acutecaretesting/articles/fig-6-example.jpg?h=402&w=750)


# 2 How to Remove or Replace Outlier 
# **2.1 Caping :** 
In this technique, we cap our outliers data and make the limit i.e, above a particular value or less than that value, all the values will be considered as outliers, and the number of outliers in the dataset gives that capping number.

![](https://miro.medium.com/max/1400/0*M4ZSmi8idYdsyzEp.png)

# **2.2 Trimming :** 
It excludes the outlier values from our analysis. By applying this technique our data becomes thin when there are more outliers present in         the dataset. Its main advantage is its fastest nature.

![](https://i.pinimg.com/564x/20/0a/1a/200a1ab75f158986c52d3b59e3b1a501.jpg)


# **2.3 Treat outliers as a missing value :** 
By assuming outliers as the missing observations, treat them accordingly i.e, same as those of missing values.

![](https://i.pinimg.com/564x/68/d0/cf/68d0cf2571b5a39eb3fca23d5bba8c70.jpg)

# **2.4 Discretization or Binning :** 
in this technique, by making the groups we include the outliers in a particular group and force them to behave in the same manner as those of other points in that group. 
![](https://cxl.com/wp-content/uploads/2017/01/010211_dp_table_big.png)

Lets do some experiment

In [None]:
#Assumption– The features are normally or approximately normally distributed.

#Step-1: Importing Necessary Dependencies

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import pandas as pd
from collections import Counter
from sklearn.preprocessing import StandardScaler
import numpy as np
import seaborn as sns

In [None]:
#Step-2: Read and Load the Dataset
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df.sample(5)
df.info()

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(16,5))
plt.subplot(2,3,1)
sns.distplot(df['degree_p'])
plt.subplot(2,3,2)
sns.distplot(df['salary'])
plt.subplot(2,3,3)
sns.distplot(df['hsc_p'])
plt.subplot(2,3,4)
sns.distplot(df['ssc_p'])
plt.subplot(2,3,5)
sns.distplot(df['mba_p'])
plt.subplot(2,3,6)
sns.distplot(df['etest_p'])
plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.scatter(df['degree_p'],df['salary'])
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()

In [None]:
#Step-4: Finding the Boundary Values
Highest_allowed_degree=df['degree_p'].mean() + 3*df['degree_p'].std()
Lowest_allowed_degree=df['degree_p'].mean() - 3*df['degree_p'].std()
Highest_allowed_Salary=df['salary'].mean() + 3*df['salary'].std()
Lowest_allowed_Salary=df['salary'].mean() - 3*df['salary'].std()

print("Highest allowed_degree",Highest_allowed_degree)
print("Lowest allowed_degree : ",Lowest_allowed_degree)
print("Highest allowed Salary",Highest_allowed_Salary)
print("Lowest allowed Salary",Lowest_allowed_Salary)

In [None]:
#Step-5: Finding the Outliers
Outlier_data=df[(df['salary'] > Highest_allowed_Salary) | (df['salary'] < Lowest_allowed_Salary)].append(df[(df['degree_p'] > Highest_allowed_degree) | (df['degree_p'] < Lowest_allowed_degree)])
Outlier_data

In [None]:
#plot outliers
import matplotlib.pyplot as plt
plt.scatter(df['degree_p'],df['salary'])
plt.scatter(Outlier_data['degree_p'],Outlier_data['salary'])
#changel lim according to your data
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()

# **1. Capping Code**

In [None]:
#Step-5: Finding the Outliers
df_c=df.copy()
df_c['degree_p'][df_c['degree_p']>Highest_allowed_degree]=Highest_allowed_degree
df_c['degree_p'][df_c['degree_p']<Lowest_allowed_degree]=Lowest_allowed_degree

df_c['salary'][df_c['salary']>Highest_allowed_Salary]=Highest_allowed_Salary
df_c['salary'][df_c['salary']<Lowest_allowed_Salary]=Lowest_allowed_Salary

After Capping let see our data

In [None]:
#plot outliers
plt.scatter(df_c['degree_p'],df_c['salary'])
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()
plt.scatter(df['degree_p'],df['salary'])
plt.scatter(Outlier_data['degree_p'],Outlier_data['salary'])
#changel lim according to your data
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()

# **2. Triming Code**

In [None]:
#just apply pands dataframe filter
df_t=df.copy()
df_t=df_t[(df_t['salary'] < Highest_allowed_Salary) & (df_t['salary'] > Lowest_allowed_Salary)]
df_t=df_t[(df_t['degree_p'] < Highest_allowed_degree) & (df_t['degree_p'] > Lowest_allowed_degree)]
plt.scatter(df_t['degree_p'],df_t['salary'])
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()
plt.scatter(df['degree_p'],df['salary'])
plt.scatter(Outlier_data['degree_p'],Outlier_data['salary'])
#changel lim according to your data
plt.xlim(0,100)
plt.ylim(0,1000000)
plt.show()


# 3. Treat outliers as a missing value Code:

In [None]:
df_n=df.copy()
df_n['salary'][(df_n['salary'] >= Highest_allowed_Salary) | (df_n['salary'] <= Lowest_allowed_Salary)]=np.nan
df_n['degree_p'][(df_n['degree_p'] >= Highest_allowed_degree) | (df_n['degree_p'] <= Lowest_allowed_degree)]=np.nan
df_n.describe()

Now you can impute the Nan value with some imputers see the below link how to impute the Nan value
https://www.kaggle.com/mukulkirti/handle-missing-value

# 4. Descretization

In [None]:
from numpy.random import randn
from sklearn.preprocessing import KBinsDiscretizer
from matplotlib import pyplot
data=df['salary'].dropna().copy()

# histogram of the raw data
pyplot.hist(data, bins=100)
pyplot.show()
# reshape data to have rows and columns

# discretization transform the raw data
kbins = KBinsDiscretizer(n_bins=25, encode='ordinal', strategy='kmeans')
data=np.asarray(data)
data = data.reshape((len(data),1))

data_trans = kbins.fit_transform(data)
# summarize first few rows

# histogram of the transformed data
pyplot.hist(data_trans, bins=25)
pyplot.show()

# 3. Technique to handle Outliers


# 3.1 Z-Score or Extreme Value Analysis (parametric)

The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. This makes z-score a parametric method. Very frequently data points are not to described by a gaussian distribution, this problem can be solved by applying transformations to data ie: scaling it.
Some Python libraries like Scipy and Sci-kit Learn have easy to use functions and classes for a easy implementation along with Pandas and Numpy.
After making the appropriate transformations to the selected feature space of the dataset, the z-score of any data point can be calculated with the following expression:

![](https://miro.medium.com/max/170/0*TwXvmgI5j7ArPPq4.)

When computing the z-score for each sample on the data set a threshold must be specified. Some good ‘thumb-rule’ thresholds can be: 2.5, 3, 3.5 or more standard deviations.

![](https://miro.medium.com/max/679/0*y7kVHEQPQKBg3Cga.)

By ‘tagging’ or removing the data points that lay beyond a given threshold we are classifying data into outliers and not outliers

Z-score is a simple, yet powerful method to get rid of outliers in data if you are dealing with parametric distributions in a low dimensional feature space. For nonparametric problems Dbscan and Isolation Forests can be good solutions.

In [None]:
df.shape

In [None]:
from scipy import stats
import pandas as pd
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()
df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df
d=pd.DataFrame(stats.zscore(df['salary']),columns=['z_score'])
d=d[(d['z_score']>3) | (d['z_score']<-3)]
d.head()

In [None]:
df.shape[0]

In [None]:
degree=[]
salary=[]
for i in df.index:
    if( i in d.index): 
        salary.append(df.loc[i]['salary'])
        degree.append(df.loc[i]['degree_p'])

In [None]:
print(salary,degree)

In [None]:
import matplotlib.pyplot as plt
plt.scatter(df['degree_p'],df['salary'])
plt.scatter(degree,salary)
plt.show()

# 3.2 Isolation Forest

This is a non-parametric method for large datasets in a one or multi dimensional feature space.

An important concept in this method is the isolation number.
isolation forests are an effective method for detecting outliers or novelties in data. It is a relatively novel method based on binary decision trees. Sci-Kit Learn’s implementation is relatively simple and easy to understand.
Isolation forest’s basic principle is that outliers are few and far from the rest of the observations. To build a tree (training), the algorithm randomly picks a feature from the feature space and a random split value ranging between the maximums and minimums. This is made for all the observations in the training set. To build the forest a tree ensemble is made averaging all the trees in the forest.
Then for prediction, it compares an observation against that splitting value in a “node”, that node will have two node children on which another random comparisons will be made. The number of “splittings” made by the algorithm for an instance is named: “path length”. As expected, outliers will have shorter path lengths than the rest of the observations.
An outlier score can computed for each observation:

The isolation number is the number of splits needed to isolate a data point. This number of splits is ascertained by following these steps:

1. A point “a” to isolate is selected randomly.

2. A random data point “b” is selected that is between the minimum and maximum value and different from “a”.

3. If the value of “b” is lower than the value of “a”, the value of “b” becomes the new lower limit.

4. If the value of “b” is greater than the value of “a”, the value of “b” becomes the new upper limit.

This procedure is repeated as long as there are data points other than “a” between the upper and the lower limit.

It requires fewer splits to isolate an outlier than it does to isolate a non-outlier, i.e. an outlier has a lower isolation number in comparison to a non-outlier point. A data point is therefore defined as an outlier if its isolation number is lower than the threshold.

The threshold is defined based on the estimated percentage of outliers in the data, which is the starting point of this outlier detection algorithm.

An explanation with images of the isolation forest technique is available at https://quantdare.com/isolation-forest-algorithm/.



In [None]:
from sklearn.ensemble import IsolationForest
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()
df.index=[i for i in range(0,148)]#reindexing
model=IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.05),max_features=1.0)
model.fit(df[['ssc_p']],df[['hsc_p']])

dg=pd.DataFrame({'ssc_p':df['ssc_p'],
                 'score':model.decision_function(df[['ssc_p']]),
                 'anomaly':model.predict(df[['ssc_p']]),
                 'hsc_p':df['hsc_p']})
import matplotlib.pyplot as plt
dg2=dg[dg['anomaly']==-1]
dg2

**Ploting of outlier over data**

In [None]:

plt.scatter(dg['ssc_p'],dg['hsc_p'])
plt.scatter(dg2['ssc_p'],dg2['hsc_p'])
plt.show()

**print the anomalies**

In [None]:
anomaly=dg.loc[dg['anomaly']==-1]
anomaly_index=list(anomaly.index)
print(anomaly)

# 3.3 DBScane Anomaly Detection
This is a clustering algorithm (an alternative to K-Means) that clusters points together and identifies any points not belonging to a cluster as outliers. It’s like K-means, except the number of clusters does not need to be specified in advance.

The method, step-by-step:
1. Randomly select a point not already assigned to a cluster or designated as an outlier. Determine if it’s a core point by seeing if there are at least min_samples points around it within epsilon distance.
1. Create a cluster of this core point and all points within epsilon distance of it (all directly reachable points).
1. Find all points that are within epsilon distance of each point in the cluster and add them to the cluster. Find all points that are within epsilon distance of all newly added points and add these to the cluster. Rinse and repeat. (i.e. perform “neighborhood jumps” to find all density-reachable points and add them to the cluster).

Lingo underlying the above:
1. Any point that has at least min_samples points within epsilon distance of it will form a cluster. This point is called a core point. The core point will itself count towards the min_samples requirement.
2. Any point within epsilon distance of a core point, but does not have min_samples points that are within epsilon distance of itself is called a borderline point and does not form its own cluster.
3. A border point that is within epsilon distance of multiple core points (multiple epsilon balls) will arbitrarily end up in just one of these resultant clusters.
4. Any point that is randomly selected that is not found to be a core point or a borderline point is called a noise point or outlier and is not assigned to any cluster. Thus, it does not contain at least min_samples points that are within epsilon distance from it or is not within epsilon distance of a core point.
5. The epsilon-neighborhood of point p is all points within epsilon distance of p, which are said to be directly reachable from p.
6. A point contained in the neighborhood of a point directly reachable from p is not necessarily directly reachable from p, but is density-reachable.
7. Any point that can be reached by jumping from neighborhood to neighborhood from the original core point is density-reachable.

Implementation Considerations:
1. You may need to standardize / scale / normalize your data first.
2. Be mindful of data type and the distance measure. I’ve read that the gower distance metric can be used for mixed data types. I’ve implemented Euclidean, here, which needs continuous variables, so I removed gender.
3. You will want to optimize epsilon and min_samples.

In [None]:

df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()[['degree_p','salary']]
df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df


#before DBSCAN you must scale your dataset
stscaler = StandardScaler().fit(df)
df = pd.DataFrame(stscaler.transform(df))
print(df)
df.describe()

In [None]:

dbsc = DBSCAN(eps = 1.3, min_samples = 25).fit(df)
labels = dbsc.labels_
print(Counter(labels))


In [None]:
outliers=df[dbsc.labels_==-1]
outliers

In [None]:
df.head()

In [None]:
plt.scatter(df[0],df[1])
plt.scatter(outliers[0],outliers[1])
plt.xlabel("Degree_p")
plt.ylabel("salary")
plt.show()

# 3.4 Elliptic Envelope
is intuitively built on the premise that data comes from a known distribution. If we draw an ellipse around the gaussian distribution of data, anything that lies outside the ellipse will be considered an outlier.
The assumption in the Elliptic Envelope is that normal data points are Gaussian distributed.

In [None]:
import pandas as pd
from sklearn.covariance import EllipticEnvelope
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import numpy as np
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
%matplotlib inline

In [None]:
from sklearn.covariance import EllipticEnvelope
import pandas as pd
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()[['degree_p','salary']]
df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df

#before E.E. you must scale your dataset
stscaler = StandardScaler().fit(df)
df = pd.DataFrame(stscaler.transform(df))
clf = EllipticEnvelope(contamination=0.02,random_state=100)
clf.fit(df)
ee_scores = pd.Series(clf.decision_function(df)) 
ee_predict = clf.predict(df)
print(Counter(ee_predict))
outliers=df[ee_predict==-1]
outliers

In [None]:
plt.scatter(df[0],df[1])
plt.scatter(outliers[0],outliers[1])
plt.xlabel("Degree_p")
plt.ylabel("Salary")
plt.show()

# 3.5. One-Class Support Vector Machines
The OneClassSVM is known to be sensitive to outliers and thus does not perform very well for outlier detection. This estimator is best suited for novelty detection when the training set is not contaminated by outliers. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM might give useful results in these situations depending on the value of its hyperparameters.

In [None]:
from sklearn import svm
import pandas as pd
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()[['degree_p','salary']]
df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df
#before this you must scale your dataset
stscaler = StandardScaler().fit(df)
df = pd.DataFrame(stscaler.transform(df))
clf=svm.OneClassSVM(nu=0.05,kernel='rbf',gamma=.01)
clf.fit(df)
predict=clf.predict(df)
outliers=df[predict==-1]
print(Counter(predict))
outliers


In [None]:
plt.scatter(df[0],df[1])
plt.scatter(outliers[0],outliers[1])
plt.xlabel("Degree_p")
plt.ylabel("Salary")
plt.show()

# 3.6 Local Outlier Factor
LOF uses density-based outlier detection to identify local outliers, points that are outliers with respect to their local neighborhood, rather than with respect to the global data distribution. The higher the LOF value for an observation, the more anomalous the observation.
This is useful because not all methods will not identify a point that’s an outlier relative to a nearby cluster of points (a local outlier) if that whole region is not an outlying region in the global space of data points.
A point is labeled as an outlier if the density around that point is significantly different from the density around its neighbors.
In the below feature space, LOF is able to identify P1 and P2 as outliers, which are local outliers to Cluster 2 (in addition to P3).


**The method, step-by-step:**

1. For each point P, do the following:
1. Calculate distances between P and every other point (manhattan = |x1-x2| + |y1-y2|) = dist(p1,p2)
1. Find the Kth closest point (Kth nearest neighbor’s distance=K-Dist(P))
1. Find the K closest points (those whose distances are smaller than the Kth point), the K-distance neighborhood of P, Nk(P).
1. Find its density (Local Reachability Density= LRDk(p) — a measure of how close its neighbors are to it), basically the inverse of the avg distance between point p and its neighbors. The lower the density, the farther p is from its neighbors.
1. Find its local outlier factor, LOFk(p)   =   sum(reachability distances of neighbors to P) x sum(densities of neighbors). LOFk(P) is basically the sum of the distances between P and its neighboring points, weighted by the sum those points’ densities (how far they are from their k neighboring points).


**Additional Details**

1.  For step 2, If 2 points have the same distance to P, then just select one as the next closest, and the other as the next next closest.
1. For step 4, LRD = Local Reachability Density = inverse(avg reachability distance between P and its neighbors) <= 1. The word reachability is used because if a neighbor is closer to P than it’s Kth neighbor, then the distance of the Kth neighbor is used instead as a means of smoothing
1. For step 4, each reachability distance of a point P’s k neighbors is reachdistk(n1<-p) = max(distk(n1), dist(n1,p))

1. For step 4, total distances of neighboring points is divided by the number of neighboring points (or ||Nk(P)||), computed using the results of step 3


**Scenarios affecting LOF values:**
Higher LOF values indicate a greater anomaly level and that

LOFk(p) =sum(reachability distances of its neighbors to P) x sum(neighbor densities)

The LOF for a point P will have a:

1. High value if → P is far from its neighbors and its neighbors have high densities (are close to their neighbors) (LOF = (high distance sum) x (high density sum) = High value)
1. Less high value if -> P is far from its neighbors, but its neighbors have low densities (LOF = (high sum) x (low sum) = middle value)
1. Less high value if -> P is close to its neighbors and its neighbors have low densities (LOF = (low sum) x (low sum) = low value )
1. Adjusting K:
1. Increase K too much and you’re just looking for outliers with respect to the entire dataset, so points far away from the highest density regions could be misclassified as outliers, even though they themselves reside in a cluster of points.
1. Reduce K too much and you’re looking for outliers with respect to very small local regions of points. This could also lead to the misclassification as outliers.

In [None]:
from sklearn.neighbors import LocalOutlierFactor
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')
df=df.dropna()[['degree_p','salary']]
df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df

stscaler = StandardScaler().fit(df)
df = pd.DataFrame(stscaler.transform(df))
clf = LocalOutlierFactor(n_neighbors=25, contamination=.09)
y_pred = clf.fit_predict(df)
LOF_Scores = clf.negative_outlier_factor_
LOF_pred=pd.Series(y_pred)
outliers=df[LOF_pred==-1]
print(Counter(LOF_pred))
outliers

In [None]:
plt.scatter(df[0],df[1])
plt.scatter(outliers[0],outliers[1])
plt.xlabel("Degree_p")
plt.ylabel("Salary")
plt.show()

# 3.7 Interquartile Range(IQR)
IQR finds the data point that lies outside the overall distribution of the dataset.
Any data point that falls outside of 1.5 times of an interquartile range below the 1st quartile and above the 3rd quartile is considered an Outlier.

![](https://miro.medium.com/max/875/1*FUNw6hD0X-L-WkilD6u3Qg.png)

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv('../input/placement-data-full-class/Placement_data_full_class.csv')

df=df.dropna()[['degree_p','salary']]

df.index=[i for i in range(0,148)]#reindexing | change accordingle to reset index of df

stscaler = StandardScaler().fit(df)
df = pd.DataFrame(stscaler.transform(df))
df

In [None]:

q1, q3= np.percentile(df[1],[25,75])
iqr=q3-q1
#Lower and upper bound for the outliers
lower_bound = q1 -(2.5 * iqr) 
upper_bound = q3 +(2.5 * iqr)


In [None]:
outliers=df[(df[1]>upper_bound) | (df[1]<lower_bound)]
outliers

In [None]:
plt.scatter(df[1],df[1])
plt.scatter(outliers[1],outliers[1])
plt.xlabel("Salary")
plt.ylabel("Salary")
plt.show()