# Discussion Related with Outliers and Impact on Machine Learning 

### Which Machine Learning Models are sensitive to outliers?

1. Naive Bayes Classifier----------------Not Sensitive to Outliers
2. SVM----------------------------------Not Sensitive to Outliers
3. Linear Regression---------------------Sensitive to Outliers
4. Logistic Regression-------------------Sensitive to Outliers
5. Decision Tree Regressor or Classifier-Not Sensitive to Outliers
6. Ensemble(RF, XGBoost, GB)-------------Not Sensitive to Outliers
7. KNN-----------------------------------Not Sensitive to Outliers
8. Kmeans--------------------------------Sensitive to Outliers
9. Hierarchical--------------------------Sensitive to Outliers
10. PCA----------------------------------Sensitive to Outliers
11. Neural Networks----------------------Sensitive to Outliers

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('../input/titanic-machine-learning-from-disaster/train.csv')
df.head()

In [None]:
df['Age'].isnull().sum()

In [None]:
import seaborn as sns

In [None]:
sns.distplot(df['Age'].dropna())

In [None]:
sns.distplot(df['Age'].fillna(100))

### Gaussian Distributed

In [None]:
figure = df.Age.hist(bins=50)
figure.set_title('Age')
figure.set_xlabel('Age')
figure.set_ylabel('No of passenger')

In [None]:
figure = df.boxplot(column='Age')

In [None]:
df['Age'].describe()

### If the Data is normally Distributed we use this

In [None]:
# assuming Age follows a gaussian distribution we will calculate the boundaries which differentiates the outliers
upper_boundary = df['Age'].mean() + 3*df['Age'].std()
lower_boundary = df['Age'].mean() - 3*df['Age'].std()
print(lower_boundary), print(upper_boundary), print(df['Age'].mean())

### If features are skewed we use the below technique 

In [None]:
figure = df.Fare.hist(bins=50)
figure.set_title('Fare')
figure.set_xlabel('Fare')
figure.set_ylabel('No of passenger')

In [None]:
df.boxplot(column='Fare')

In [None]:
df['Fare'].describe()

In [None]:
#lets compute the interquantile range to calculate the boundaries
IQR = df.Fare.quantile(0.75)-df.Fare.quantile(0.25)

In [None]:
lower_bridge = df['Fare'].quantile(0.25)-(IQR*1.5)
upper_bridge = df['Fare'].quantile(0.75)+(IQR*1.5)
print(lower_bridge), print(upper_bridge)

In [None]:
#extreme outliers
lower_bridge = df['Fare'].quantile(0.25)-(IQR*3)
upper_bridge = df['Fare'].quantile(0.75)+(IQR*3)
print(lower_bridge), print(upper_bridge)

In [None]:
data = df.copy()

In [None]:
data.loc[data['Age']>=73,'Age']=73

In [None]:
data.loc[data['Fare']>=100,'Fare']=100

In [None]:
figure=data.Age.hist(bins=50)
figure.set_title('Age')
figure.set_xlabel('Age')
figure.set_ylabel('No of passenger')

In [None]:
figure=data.Fare.hist(bins=50)
figure.set_title('Fare')
figure.set_xlabel('Fare')
figure.set_ylabel('No of passenger')

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(data[['Age','Fare']].fillna(0),data['Survived'],test_size=0.3)

In [None]:
# Applying Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier=LogisticRegression()
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred1=classifier.predict_proba(X_test)

from sklearn.metrics import accuracy_score,roc_auc_score
print("Accuracy_score: {}".format(accuracy_score(y_test,y_pred)))
print("roc_auc_score: {}".format(roc_auc_score(y_test,y_pred1[:,1])))

In [None]:
# Applying Logistic Regression
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred1=classifier.predict_proba(X_test)

from sklearn.metrics import accuracy_score,roc_auc_score
print("Accuracy_score: {}".format(accuracy_score(y_test,y_pred)))
print("roc_auc_score: {}".format(roc_auc_score(y_test,y_pred1[:,1])))