<h1><center>Cancer Disease Prediction System Using Machine Learning</center></h1>

## The problem statement is that we need to predict those people who are prone to have Cancer Disease with respect to different parameters such as their `Age`, `Gender`, `Alcohol Use`, `Genetic Risk` and `Smoking`.

## Importing useful Libraries

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn-whitegrid")

* **There is also imbalance in data, so let's see how we train and predict with this dataset.**

In [None]:
# Importing our Dataset

cancer_patient = pd.read_csv("../input/cancer-patientscsv/cancer_patient.csv")
cancer_patient.head()

In [None]:
len(cancer_patient)

In [None]:
cancer_patient.info();

### Data is already clean, no null values so let's analyse further

In [None]:
cancer_patient.describe().T

In [None]:
fig = plt.figure(figsize = (13,8))
sns.heatmap(cancer_patient.corr(),cmap='coolwarm',annot=True);

In [None]:
fig, ax = plt.subplots()
hist = ax.hist(x = cancer_patient["Age"]);

In [None]:
#Required outside of function. This needs to be activated first when plotting in every code block
fig, ax = plt.subplots()

#Count plot
plot = sns.countplot(data = cancer_patient, x='Level', hue='Gender', palette=['darkblue','darkred'])

In [None]:
cancer_patient.columns

**Cancer found in people age over 50**

In [None]:
cancer_over50 = cancer_patient[cancer_patient["Age"] > 50]
cancer_over50.head()

In [None]:
# Making Subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols= 2, figsize=(10, 10))

# Adding Data to the plot
scatter = ax1.scatter(x = cancer_over50["Age"], y = cancer_over50["Alcohol use"], cmap = "winter")

# For Plot ax1
ax1.set(title = "Age with respect to Alcohol Use", 
        xlabel = "Age", 
        ylabel = "Alcohol Use")
ax1.axhline(cancer_over50["Alcohol use"].mean(),
           linestyle = "--");
ax1.set_xlim([50, 80])
ax1.set_ylim([0, 8.5])

# For Plot ax2
scatter = ax2.scatter(x = cancer_over50["Age"], y = cancer_over50["Genetic Risk"])
ax2.set(title = "Age with respect to Genetic Risk", xlabel = "Age", ylabel = "Genetic Risk")
ax2.axhline(cancer_over50["Genetic Risk"].mean(),
           linestyle = "--");
ax2.set_xlim([50, 80])
ax2.set_ylim([0, 7.5])

# For Plot ax3
scatter = ax3.scatter(x = cancer_over50["Age"], y = cancer_over50["Smoking"])
ax3.set(title = "Age with respect to Smoking", xlabel = "Age", ylabel = "Smoking")
ax3.axhline(cancer_over50["Smoking"].mean(),
           linestyle = "--");
ax3.set_xlim([50, 80])
ax3.set_ylim([0, 8.5])

# For Plot ax4
scatter = ax4.scatter(x = cancer_over50["Gender"], y = cancer_over50["Alcohol use"])
ax4.set(title = "Age with respect to Alcohol Use", xlabel = "Gender", ylabel = "Alcohol Use")
ax4.axhline(cancer_over50["Alcohol use"].mean(),
           linestyle = "--");

In [None]:
cancer_over50.head()

In [None]:
len(cancer_patient), len(cancer_over50)

**There are only 134 patients who are Over 50 so we analyse the entire data irresespective of age to achieve fruitfull results later.**

In [None]:
cancer_patient.columns

In [None]:
fig, ax = plt.subplots()

histt = ax.hist(x = cancer_patient["Level"], bins = 10, color ='darkred')

ax.set(xlabel = "Level", ylabel = "Count");

In [None]:
cancer_patient.info()

### As we can see `Level` dtype is not int so first we replace it with numbers then into type int

In [None]:
cancer_patient["Level"].replace(["Low", "Medium", "High"], ["0", "1", "2"], inplace=True)

In [None]:
cancer_patient["Level"] = cancer_patient["Level"].astype(int)

In [None]:
cancer_patient.head().info()

#### Above entire data is in `int`

**Plotting with respect to `Age` and `Genetic Risk`**

In [None]:
fig, ax = plt.subplots(figsize = (10, 6));

scatter = ax.scatter(x = cancer_patient["Age"], 
                     y = cancer_patient["Genetic Risk"],
                     c = cancer_patient["Level"],
                     cmap = "winter")

ax.set(xlabel = "Age", 
       ylabel = "Genetic Risk");

ax.legend(*scatter.legend_elements(), title = "Level");

ax.axhline(cancer_patient["Level"].mean(),
           linestyle = "--");

In [None]:
cancer_patient.plot.kde(figsize = (20,8));

In [None]:
np.array([cancer_patient["Gender"][:10]])

**Number of `Male` & `Females`**

In [None]:
male = 0
female = 0
for i in cancer_patient["Gender"]:
    if i == 1:
        male += 1
    elif i == 2:
        female += 1
f"Number of Male: {male}, Number of females: {female}"

In [None]:
# Make a histogram here
cancer_patient_male = cancer_patient[cancer_patient["Gender"] == 1]
cancer_patient_male.head()

In [None]:
cancer_patient_female = cancer_patient[cancer_patient["Gender"] == 2]
cancer_patient_female.head()

In [None]:
plt.hist(cancer_patient_male);

In [None]:
plt.hist(cancer_patient_female);

In [None]:
cancer_patient_male.plot.hist(figsize = (15, 50), subplots = True);

In [None]:
cancer_patient_female.plot.hist(figsize = (15, 50), subplots = True);

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(x = cancer_patient_male["Alcohol use"], y = cancer_patient_male["Smoking"])
# cancer_patient_male.plot(x = cancer_patient_male["Alcohol use"], y = cancer_patient_male["Age"], kind = "scatter");

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(x = cancer_patient_female["Alcohol use"], y = cancer_patient_female["Smoking"]);

In [None]:
fig, ax = plt.subplots()
cancer_patient_male.plot(kind = "bar", x = "Genetic Risk", y = "Smoking", ax = ax);

In [None]:
len(cancer_patient_male), len(cancer_patient_female)

In [None]:
cancer_patient.head()

In [None]:
fig, ax = plt.subplots()
cancer_patient.plot(kind = "bar", x = "Gender", y = "Age", ax = ax);

In [None]:
fig, ax = plt.subplots(figsize = (10, 6))
scatter = ax.scatter(x = cancer_patient["Age"], 
                     y = cancer_patient["Alcohol use"], 
                     c = cancer_patient["Level"], 
                     cmap = "winter")

ax.set(xlabel = "Age", 
       ylabel = "Alcohol use");

ax.legend(*scatter.legend_elements(), title = "Level");

ax.axhline(cancer_patient["Level"].mean(),
           linestyle = "--");

In [None]:
fig, ax=plt.subplots()#Required outside of function. This needs to be activated first when plotting in every code block
plot=sns.scatterplot(data=cancer_patient, 
                     x='Alcohol use',
                     y='Fatigue', 
                     hue='Level', 
                     palette=['darkblue','darkred','darkgreen'], 
                     s=50, 
                     marker='o')#Count plot

In [None]:
fig, ax=plt.subplots()#Required outside of function. This needs to be activated first when plotting in every code block
plot=sns.scatterplot(data=cancer_patient, 
                     x='Genetic Risk',
                     y='Smoking', 
                     hue='Level', 
                     palette=['darkblue','darkred','darkgreen'], 
                     s=50, 
                     marker='o')#Count plot

## Our data is analyzed and ready for Model Training and Machine Learning

In [None]:
cancer_patient.head()

In [None]:
cancer_patient.drop(["Patient Id"], axis = 1, inplace= True)

In [None]:
cancer_patient.head()

### Fitting the model/algorithm and use it to make predictions on our data.

#### First we use Support Vector Machine Estimator

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split

X = cancer_patient.drop(["Level"], axis = 1)
y = cancer_patient["Level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

sv = svm.SVC()
sv.fit(X_train, y_train)
sv.score(X_test, y_test)

In [None]:
y_preds = sv.predict(X_test)
y_preds[:10]

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, y_preds))

In [None]:
confusion_matrix(y_test, y_preds)

In [None]:
accuracy_score(y_test, y_preds)

### Checking accuracy with other model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = cancer_patient.drop(["Level"], axis = 1)
y = cancer_patient["Level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

**So the result from KNeighborClassifier has increaed and we can clearly see the change**

#### Lastly we use RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X = cancer_patient.drop(["Level"], axis = 1)
y = cancer_patient["Level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr.score(X_test, y_test)

**Cross val for all the above algorithms to make sure for scores accuracy**

In [None]:
from sklearn.model_selection import cross_val_score

crossVal_sv = cross_val_score(sv, X, y)
crossVal_knn = cross_val_score(knn, X, y)
crossVal_rfr = cross_val_score(rfr, X, y)

print(f"For SupportVectorMachine: {crossVal_sv}, \nFor KNeighborClassifier: {crossVal_knn}, \nFor RandomForestRegressor: {crossVal_rfr}")

**Comparing `Score` results with `Cross Value Score`**

In [None]:
# For SupportVectorMachine

np.random.seed(42)

sv_single_score = sv.score(X_test, y_test)

sv_cross_val_score = np.mean(cross_val_score(sv, X, y))

sv_single_score, sv_cross_val_score

In [None]:
# For KNeighborClassifier

np.random.seed(42)

knn_single_score = knn.score(X_test, y_test)

knn_cross_val_score = np.mean(cross_val_score(knn, X, y))

knn_single_score, knn_cross_val_score

##### KNeighborClassifier almost giving 100% results