### Unsupervised learning - prediction and visualization using Kmeans clustering

Unsupervised learning involves building a machine learning algorithm that can draw inferences from datasets without labelling the responses. In other words, classifying the datasets into clusters using the training set and later using the trained model to identify which cluster the test data will automatically go into.

K-means clustering is a very good technique to carry out unsupervised learning. This involves splitting the dataset into K number of clusters. Then random but unique centroids are chosen for each cluster and that is used to train the KNN classifier. This classifier is used to build the initial random set of clusters. Thereafter the centroid keeps adjusting itself to the mean of the clusters and this is a process that goes on for several iterations till the centroid stabilizes.

__________________________________________________________________________________________________________________________________

### Importing libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

### Reading the file

In [None]:
df=pd.read_csv('../input/titanic-machine-learning-from-disaster/train.csv')
df.head()

### Some Initial Analysis of dataset

In [None]:
print('Median fare of survived', df[df['Survived']==1]['Fare'].median())
print('Median fare of not survived', df[df['Survived']==0]['Fare'].median())
print('Number of Unique passenger ids=',len(df['PassengerId'].unique()))
print('Number of Unique Tickets=',len(df['Ticket'].unique()))
print('Number of Unique cabins=',len(df['Cabin'].unique())) ##Not a reliable number due to presence of missing information

A few observations:
1. The name format is like that of in earlier days that is Surname, Title, Name, Middle name.
2. Passenger id is unique for all and Ticket number, Cabin may be family wise.
3. Assuming people with the same last name are a family and are together.
4. The median fare of those survived is much higher than median fare of not survived.
5. Sex and Embarked are categorical featured and must be converted for the purpose of model.
6. The title of the person can be obtained and such title may tell us about the age and gender group of the individual.

## 1. Data Preprocessing

### Extracting Surnames for the purpose of identifying families

In [None]:
def clean_name(x):
    l=[]
    if isinstance(x,str):
        l=x.split(", ")
        x=l[0]
    return(x)

df['Surname'] = df['Name'].apply(clean_name).astype('str')

### Getting titles for the purpose of gender vs age bracket

In [None]:
def clean_title(x):
    l=[]
    t=[]
    if isinstance(x,str):
        l=x.split(", ")
        s=l[1]
        t=s.split(". ")
        x= t[0]
    return(x)

df['Title'] = df['Name'].apply(clean_title).astype('str')

In [None]:
df['Title'].value_counts()

There are undoubtedly too many titles to work with and by exploring this part of the data we can group the irrelevant titles into 4 basic title categories

In [None]:
def new_title(x,Sex,Age):
    a=''
    if isinstance(x,str):
        if x in ['Mr', 'Mrs', 'Miss', 'Master']:
            a=x
        else:
            if Sex=='female' and Age<30:
                a='Miss'
            elif Sex=='female' and Age>=30:
                a='Mrs'
            elif Sex=='male' and Age>=18:
                a='Mr'
            else:
                a='Master'
    return(a)            

df['Title'] = df.apply(lambda x: new_title(x['Title'], x['Sex'],x['Age']), axis=1)

### Find out how many families are present aboard

In [None]:
sur= df.groupby('Surname').count()['Title']
df['Fam_count']=df['Surname'].map(sur)

### Mark passenger travelling with families as 1 and individual passengers as 0

In [None]:
def isfam(x):
    if x>1:
        a=1
    else:
        a=0
    return(a)

df['IsFamily']=df['Fam_count'].apply(isfam)

### Encoding Title and Sex becasue it is categorical

In [None]:
df['new_title']=df['Title'].replace({'Mr':0,'Mrs':1,'Master':2,'Miss':3})
df['Sex']=df['Sex'].replace({'male':0,'female':1})

### Drop irrelevant columns

In [None]:
df1= df.drop(['PassengerId','Name','Ticket','Cabin','Title','Surname','Fam_count'],axis=1)
df1.head()

### Check for presence of NaN values because models cannot build with an unclean dataset

In [None]:
df1.isnull().sum()

### Null value imputation for features that show presence of NaN

In [None]:
df1['Age'].fillna(df1['Age'].median(),inplace=True)
df1['Embarked'].fillna(df1['Embarked'].mode()[0],inplace=True)

In [None]:
df1.isnull().sum() #Recheck the data

### Transform the last categorical feature

In [None]:
df1['Embarked']=df1['Embarked'].replace({'C':0,'S':1,'Q':2})

### Final checking of dataset. This is only to check the cleanliness of data from null values, encoding of categorival features and checking the datatypes.

In [None]:
df1.info()

In [None]:
df1.head()

### Final step in pre-processing the data is to transform all values in the dataset so that they come to a common scale. This prevents the model weighing automatically to bigger numbers and keeps the model unbiased.

In [None]:
x1=df1.iloc[:,[1,2,3,4,5,6,7,8,9]].values

In [None]:
from sklearn import preprocessing
X = preprocessing.scale(x1)

## 2. Building the model

In [None]:
from sklearn.cluster import KMeans
y = np.array(df['Survived'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

#### Normally unless the prediction is binary like this dataset, unsupervised learning for clustering normally involves first finding the number of clusters because we wouldn't know it. The elbow method is used to visually represent the model and find out the number of clusters.

## 3. Checking the accuracy

In [None]:
correct = 0
for i in range(len(x1)):
    predict_me = np.array(x1[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

Normally an acceptable accuracy for the model at the time of initialization is 70% and above

## 4. Prediction of Survival based on the dataset

In [None]:
pred=kmeans.predict(X)

## 5. Plotting the clusters

In [None]:
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], 
            s = 30, c = 'red', label = 'dead')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], 
            s = 30, c = 'blue', label = 'survived')


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], 
            s = 30, c = 'yellow', label = 'Centroids')

plt.legend()
plt.show()

## Cross check the prediction numbers with actual

In [None]:
a=df['Survived'].value_counts().values
b=[len(pred[pred==0]),len(pred[pred==1])]
check=pd.DataFrame({'Actual':a,'Predicted':b},columns=['Actual','Predicted'])
check

In [None]:
print('Model accuracy is: %.2f'%((correct/len(X))*100),'%')

However it is to be noted that clustering algorithms are not responsible for prediction or labelling. It will just throw the data into respective clusters not labelling which is what. So just becaue this was a binary dataset with given survived and not survived, the accuracy could be fairly estimated.

## THANK YOU