<a href="https://colab.research.google.com/github/Suchitra-V31/Machine-learning-projects/blob/main/Cervical_Cancer_Risk_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Cervical Cancer Risk Classification**



 Cervical cancer is a type of cancer that occurs in the cells of the cervix — the lower part of the uterus that connects to the vagina.



In this notebook we are going to predict whether the patient has cancer or not using various machine learning algorithms.

Let us first import all the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


**Load the Data**

In [1]:
data=pd.read_csv('../input/cervical-cancer-risk-classification/kag_risk_factors_cervical_cancer.csv')

In [1]:
data.head()

In [1]:
data.tail()

In [1]:
data.shape

In [1]:
data.info()

In [1]:
data.describe()

Let us check whether our dataset has any nan values.



In [1]:
data.isnull().sum()

We could see that our dataset does'nt have any nan values.

We could find that our dataset has some '?' . So let us replace all the '?' with 0 and then replace that 0 with the median.

In [1]:
for feature in data.columns:
    data[feature].replace('?',np.nan,inplace=True )
    data[feature].fillna(value=0,inplace=True)

In [1]:
for feature in data.columns:
    data[feature].replace(0,data[feature].median(),inplace=True)

In [1]:
data.head()

Now we can see that all our '?' are replaced with its median value.

**Exploratory Data analysis**

Let us visualize and try to understand the impact of features on dependent variable.

In [1]:
sns.distplot(data['Age'])

From this graph we could see that people with age group 20-40 has affected mostly with cervical cancer.

Let us create a dataframe df which contains all the four tests

In [1]:
df=pd.DataFrame(data[['Hinselmann','Schiller','Citology','Biopsy']])
df.head(10)

Let us find the count of cancer affected person with respect to four kind of tests.

In [1]:
for features in df.columns:
    s=df.copy()
    sns.countplot(x=s[features])
    plt.xlabel(features)
    plt.title(features)
    plt.show()

Let us find the  age count of cancer affected person with respect to four kind of tests.

In [1]:
for features in df.columns:
    s=df.copy()
    sns.barplot(x=s[features],y=data['Age'])
    plt.xlabel(features)
    plt.ylabel("Age")
    plt.title(features)
    plt.show()
    

Now let us visualize the correlation of the df dataframe which contains the target values.

In [1]:
sns.heatmap(df.corr(),annot=True)

Let us find the correlation of the whole data.

In [1]:
sns.heatmap(data.corr())

We could see that our target value has four columns. In order to make it to one we are going add the outcomes of all the four tests and store it in a seperate column called 'count'.

In [1]:
df['count']=df['Hinselmann']+df['Schiller']+df['Citology']+df['Biopsy']

In [1]:
df['count'].value_counts()

For making better predictions we are going replace 1,2,3,4 with 1 which means the patient has cancer and 0 means the patient does'nt have cancer.

In [1]:
df['result']=np.where(df['count']>0,1,df['count'])

In [1]:
df['result'].value_counts()

Let us split our data into independent and depentent feature.

In [1]:
X=data.drop(columns=['Hinselmann','Schiller','Citology','Biopsy'],axis=1)
y=df['result']

Stanadardize our features using StandardScaler.

In [1]:
from sklearn.preprocessing import StandardScaler

In [1]:
scaler=StandardScaler()

In [1]:
scaled_feature=scaler.fit_transform(X)

In [1]:
X_scaled=scaled_feature
y=df['result']

Split the data into train/test.

In [1]:
from sklearn.model_selection import train_test_split

In [1]:
X_train,X_test,y_train,y_test=train_test_split(X_scaled,y,test_size=0.3,random_state=42)

In [1]:
X_train.shape,y_train.shape

In [1]:
X_test.shape,y_test.shape

**Using Logistic Regression**

In [1]:
from sklearn.linear_model import LogisticRegression

In [1]:
l_r=LogisticRegression()

In [1]:
model=l_r.fit(X_train,y_train)

In [1]:
pred=model.predict(X_test)

In [1]:
from sklearn.metrics import classification_report,confusion_matrix

In [1]:
print('Confusion Matrix:\n ', confusion_matrix(y_test,pred))
print('Classification Report:\n ',classification_report(y_test,pred))

So we could see that our model has predicted with 90% accuracy.

**Logistic Regression Using Kfold Cross validation**

In this we are going to predict using same model but with kfold Cross validation and obtain the accurcay.

In [1]:
from sklearn.model_selection import KFold,cross_val_score

In [1]:
kfold=KFold(n_splits=10,shuffle=True,random_state=21)

In [1]:
model=LogisticRegression()

In [1]:
scores=cross_val_score(model,X,y,scoring='accuracy',cv=kfold,n_jobs=-1)


In [1]:
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

We could see that our model has predicted with 87% accuracy.

**KNN with CV**


In this we are going to use K Nearest Neighbor and find the best K value using Grid Search CV and also using Kfold cross validation.

In [1]:
from sklearn.neighbors import KNeighborsClassifier

In [1]:
knn=KNeighborsClassifier(n_jobs=-1)

In [1]:
knn_neighbors={'n_neighbors':[1,2,3,4,5,6,7,8,9,10]}

In [1]:
from sklearn.model_selection import GridSearchCV

In [1]:
classifier=GridSearchCV(knn,param_grid=knn_neighbors,cv=kfold,verbose=0).fit(X_train,y_train)

In [1]:
classifier.best_params_

In [1]:
best_grid=classifier.best_estimator_

In [1]:
best_grid

In [1]:
predict=best_grid.predict(X_test)

In [1]:
print('Confusion Matrix:\n ',confusion_matrix(y_test,predict))
print('Classification Report:\n ',classification_report(y_test,predict))

We could see that our model has predicted wit 90% accuracy.

**Using Descision Tree with Kfold CV**

In [1]:
from sklearn.tree import DecisionTreeClassifier

In [1]:
df_tree=DecisionTreeClassifier(random_state=0)

In [1]:
df_model=df_tree.fit(X_train,y_train)

In [1]:
df_pred=df_model.predict(X_test)

In [1]:
print('Confusion Matrix:\n ',confusion_matrix(y_test,df_pred))
print('Classification Report:\n ',classification_report(y_test,df_pred))

We could see that our model has predicted with 80% accuracy.

Using kfold Cross validation with descision tree model.

In [1]:
score=cross_val_score(df_model,X,y,scoring='accuracy',cv=kfold,n_jobs=-1)

In [1]:
print('Accuracy: %.3f (%.3f)' % (np.mean(score), np.std(score)))

**Using Random Forest**

In [1]:
from sklearn.ensemble import RandomForestClassifier

In [1]:
rf=RandomForestClassifier()

In [1]:
rf_model = rf.fit(X_train,y_train)

In [1]:
rf_pred=rf_model.predict(X_test)


In [1]:
print('Confusion Matrix:\n ',confusion_matrix(y_test,rf_pred))
print('Classification Report:\n ',classification_report(y_test,rf_pred))

We have got 90% accuracy.

Using Kfold CV in random forest model

In [1]:
s=cross_val_score(rf_model,X,y,scoring='accuracy',cv=kfold,n_jobs=-1)

In [1]:
print('Accuracy: %.3f (%.3f)' % (np.mean(s), np.std(s)))

We could get 87% accuracy.

**Using XGBoost**

In [1]:
import xgboost as xgb

In [1]:
xg_boost=xgb.XGBClassifier()
xg_model=xg_boost.fit(X_train,y_train)

In [1]:
xg_pred=xg_model.predict(X_test)

In [1]:
print('Confusion Matrix:\n ',confusion_matrix(y_test,xg_pred))
print('Classification Report:\n ',classification_report(y_test,xg_pred))

We could see that our model has aquired 86% accuracy

From all the regression techniques we have tried we got 90% accuracy but when we keenly observe we could see that our precision,recall ,f1 score all of its percentage has been increased.