<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **Forecasting of Breast Cancer on medical measurement**

# Lab 3. Data Analysis with Python

Estimated time needed: **30** minutes

## Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics to predict Patient's Vital Status.


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="https://#import_data">Import Data from Module</a></li>
    <li><a href="https://#pattern_visualization">Analyzing Individual Feature Patterns using Visualization</a></li>
    <li><a href="https://#discriptive_statistics">Descriptive Statistical Analysis</a></li>
    <li><a href="https://#basic_grouping">Basics of Grouping</a></li>
    <li><a href="https://#correlation_causation">Correlation and Causation</a></li>
    <li><a href="https://#anova">ANOVA</a></li>
</ol>

</div>

<hr>


<h3>What are the main characteristics that have the most impact on a Patient's Vital Status?</h3>


<h2 id="import_data">1. Import Data from Module 2</h2>


<h4>Setup</h4>


To install Seaborn we use pip, the Python package manager.

Import libraries:


In [None]:
# conda install -c anaconda scikit-learn

In [None]:
# !pip install dython
# ! mamba install seaborn=0.9.0-y

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
If error appeared, please restart kernel or run this block again.
</div>


In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import preprocessing
from dython.nominal import associations

from scipy import stats

import itertools

Load the data and store it in dataframe `df`:


In [None]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0VB7EN/breast_cancer_clean.csv'

In [None]:
df = pd.read_csv(path, index_col=0)
df.head()

Let's use "pd.options.display.float_format = '{:,.2f}'.format" for the display of float numbers in Pandas with two decimal places, separated by a comma.

In [None]:
pd.options.display.float_format = '{:,.2f}'.format


<h2 id="pattern_visualization">2. Analyzing Individual Feature Patterns Using Visualization</h2>


<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [None]:
# list the data types for each column
print(df.dtypes)

We need to transform the categorical column "Patient's Vital Status" so that we can work with it as a column of type Int. The "transform" method is used to transform the original categorical column into an array of integers, which is assigned to the variable transformed_vital_stat. Finally, a new column is created in the dataframe called "Patient's Vital Status - transformed" and its values are set to the transformed categorical data.

In [None]:
enc = preprocessing.OrdinalEncoder()
enc.fit(df[["Patient's Vital Status"]])

#Write down to array
transformed_vital_stat = enc.transform(df[["Patient's Vital Status"]])

#Create new column
df["Patient's Vital Status - transformed"] = transformed_vital_stat

df

Now we have a new column "Vital status of the patient - changed", which contains the following values: 0 - Died of Disease, 1 - Died of Other Causes, 2 - Living.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #1:</h3>

<b>What is the data type of the column "Tumor Size"? </b>

</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
df['Tumor Size'].dtypes
```

</details>


For example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":


In [None]:
corr = df.corr()
corr

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


Let's take a look on correlation heatmap of our data

In [None]:
sns.heatmap(corr, linewidths=.5)

The associations method from the dython.nominal library is used to calculate and visualize the association between the categorical variables in a Pandas dataframe.

In [None]:
associations(df[["Type of Breast Surgery", "Cancer Type", "Cancer Type Detailed", "Cellularity", "Chemotherapy", "Pam50 + Claudin-low subtype", "ER status measured by IHC", "ER Status", "HER2 status measured by SNP6", "HER2 Status", "Tumor Other Histologic Subtype", "Hormone Therapy", "Integrative Cluster", "Primary Tumor Laterality", "Oncotree Code", "Overall Survival Status", "PR Status", "Radio Therapy", "3-Gene classifier subtype", "Patient's Vital Status", "Nottingham prognostic index-binned", "Patient's Vital Status - transformed"]], annot=False)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3> Question  #2: </h3>

<p>Find the correlation between the following columns: Tumor Size, Cohort, Neoplasm Histologic Grade, and Mutation Count.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[['Tumor Size', 'Cohort', 'Neoplasm Histologic Grade', 'Mutation Count']]</p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
df[['Tumor Size', 'Cohort', 'Neoplasm Histologic Grade', 'Mutation Count']].corr()
```

</details>


<h2>Continuous Numerical Variables:</h2> 

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.</p>

<p>For forecasting, we selected column "Patient's Vital Status - transformed", but this column has only 3 values. Therefore, we will take other columns in order to show what examples of graphs exist.</p>


Let's see several examples of different linear relationships:


<h3>Positive Linear Relationship</h4>


Let's find the scatterplot of "Lymph nodes examined positive" and "Nottingham prognostic index".


In [None]:
# Lymph nodes examined positive as potential predictor variable of Nottingham prognostic index
sns.regplot(x="Lymph nodes examined positive", y="Nottingham prognostic index", data=df)
plt.ylim(0,)

<p>As the Lymph nodes examined positive goes up, the Nottingham prognostic index goes up: this indicates a positive direct correlation between these two variables. Lymph nodes examined positive seems like a pretty good predictor of Nottingham prognostic index since the regression line is almost a perfect diagonal line.</p>


We can examine the correlation between 'Lymph nodes examined positive' and 'Nottingham prognostic index' and see that it's approximately 0.52.


In [None]:
df[["Lymph nodes examined positive", "Nottingham prognostic index"]].corr()

Lymph nodes examined positive is a potential predictor variable of Relapse Free Status (Years). Let's find the scatterplot of "Lymph nodes examined positive" and "Relapse Free Status (Years)".


In [None]:
sns.regplot(x="Lymph nodes examined positive", y="Relapse Free Status (Years)", data=df)

<p>As Lymph nodes examined positive goes up, the Relapse Free Status (Years) goes down: this indicates an inverse/negative relationship between these two variables. Lymph nodes examined positive could potentially be a predictor of Relapse Free Status (Years).</p>


We can examine the correlation between 'Lymph nodes examined positive' and 'Relapse Free Status (Years)' and see it's approximately -0.22.


In [None]:
df[["Lymph nodes examined positive", "Relapse Free Status (Years)"]].corr()

<h3>Weak Linear Relationship</h3>


Let's see if "Tumor Size" is a predictor variable of "Mutation Count".


In [None]:
sns.regplot(x="Tumor Size", y="Mutation Count", data=df)

<p>Tumor Size does not seem like a good predictor of the Mutation Count at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.</p>


We can examine the correlation between 'Tumor Size' and 'Mutation Count' and see it's approximately 0.02.


In [None]:
df[["Tumor Size", "Mutation Count"]].corr()

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  3 a): </h1>

<p>Find the correlation  between x="Relapse Free Status (Months)" and y="Age at Diagnosis".</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[["Relapse Free Status (Months)", "Age at Diagnosis"]].  </p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is -0.1, the non-diagonal elements of the table.

df[["Relapse Free Status (Months)","Age at Diagnosis"]].corr()

```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Question  3 b):</h1>

<p>Given the correlation results between "Age at Diagnosis" and "Relapse Free Status (Months)", do you expect a linear relationship?</p>
<p>Verify your results using the function "regplot()".</p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python

#There is a weak correlation between the variable 'Relapse Free Status (Months)' and 'Age at Diagnosis' as such regression will not work well. We can see this using "regplot" to demonstrate this.

#Code: 
sns.regplot(x="Relapse Free Status (Months)", y="Age at Diagnosis", data=df)

```

</details>


<h3>Categorical Variables</h3>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>


Let's look at the relationship between "Integrative Cluster" and "Overall Survival (Years)".


In [None]:
sns.boxplot(x="Integrative Cluster", y="Overall Survival (Years)", data=df)

<p>We see that the distributions of Overall Survival (Years) between the different Integrative Cluster categories have a significant overlap, so Integrative Cluster would not be a good predictor of Overall Survival (Years). If you have many points that are distributed along the Y axis in your boxplot, it suggests that the data in that column has a large spread or variability. It can be seen in cluster "8" that we have different from others.

Let's examine engine "Tumor Stage" and "Patient's Vital Status - transformed":</p>


In [None]:
sns.boxplot(x="Tumor Stage", y="Patient's Vital Status - transformed", data=df)

<p>Here we see that the categories are potentially distributed well, but with stages 0 and 4 there may be a question. Stage 0 is the lightest form of the tumor, so its mortality is the lowest, stage 4 is the most advanced stage of the tumor, so its survival rate is the lowest.</p>


Let's examine "Nottingham prognostic index-binned" and "Overall Survival (Years)".


In [None]:
sns.boxplot(x="Nottingham prognostic index-binned", y="Overall Survival (Years)", data=df)

<p>Here we see that the distribution of Overall Survival (Years) between the different Nottingham prognostic index-binned categories differs. As such, Nottingham prognostic index-binned could potentially be a predictor of Overall Survival (Years).</p>


<h2 id="discriptive_statistics">3. Descriptive Statistical Analysis</h2>


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [None]:
df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:


In [None]:
df.describe(include=['object'])

<h3>Value Counts</h3>


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "Radio Therapy". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['Radio Therapy']</code>, not two brackets <code>df[['Radio Therapy']]</code>.</p>


In [None]:
df['Radio Therapy'].value_counts()

We can convert the series to a dataframe as follows:


In [None]:
df['Radio Therapy'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "radio_therapy_counts" and rename the column  'Radio Therapy' to 'value_counts'.


In [None]:
radio_therapy_counts = df['Radio Therapy'].value_counts().to_frame()
radio_therapy_counts.rename(columns={'Radio Therapy': 'value_counts'}, inplace=True)
radio_therapy_counts

Now let's rename the index to 'Radio Therapy':


In [None]:
radio_therapy_counts.index.name = 'Radio Therapy'
radio_therapy_counts

We can repeat the above process for the variable 'Tumor Stage'.


In [None]:
# Tumor Stage as variable
engine_loc_counts = df['Tumor Stage'].value_counts().to_frame()
engine_loc_counts.rename(columns={'Tumor Stage': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'Tumor Stage'
engine_loc_counts.head(10)

<p>After examining the value counts of the Tumor Stage, we see that Tumor Stage would not be a good predictor variable for the Patient's Vital Status. This is because we only have 11 patiets with 4th Tumor Stage and 24 patients with 0 Tumor Stage, so this result is skewed. Thus, we are not able to draw any conclusions about the Tumor Stage.</p>


<h2 id="basic_grouping">4. Basics of Grouping</h2>


<p>The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.</p>

<p>For example, let's group by the variable "Nottingham prognostic index". We see that there are 7 different categories of Nottingham prognostic index.</p>


In [None]:
df['Nottingham prognostic index'].unique()

<p>If we want to know, on average, which type of Nottingham prognostic index is most valuable, we can group "Nottingham prognostic index" and then average them.</p>

<p>We can select the columns 'Radio Therapy', 'Nottingham prognostic index' and 'Patient's Vital Status - transformed', then assign it to the variable "df_group_one".</p>


In [None]:
df_group_one = df[['Radio Therapy','Nottingham prognostic index',"Patient's Vital Status - transformed"]]

We can then calculate the average Patient's Vital Status for each of the different categories of data.


In [None]:
# grouping results
df_group_one = df_group_one.groupby(['Radio Therapy'],as_index=False).mean()
df_group_one

<p>From our data, it seems "True" Radio Therapy are, on average, the highest survival rate.</p>

<p>You can also group by multiple variables. For example, let's group by both 'Radio Therapy' and 'Nottingham prognostic index'. This groups the dataframe by the unique combination of 'Radio Therapy' and 'Nottingham prognostic index'. We can store the results in the variable 'grouped_test1'.</p>


In [None]:
# grouping results
df_gptest = df[['Radio Therapy','Nottingham prognostic index',"Patient's Vital Status - transformed"]]
grouped_test1 = df_gptest.groupby(['Radio Therapy','Nottingham prognostic index'],as_index=False).mean()
grouped_test1

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.</p>

<p>In this case, we will leave the Radio Therapy variable as the rows of the table, and pivot Nottingham prognostic index to become the columns of the table:</p>


In [None]:
grouped_pivot = grouped_test1.pivot(index='Radio Therapy',columns='Nottingham prognostic index')
grouped_pivot

<p>Often, we won't have data for some of the pivot cells. We can fill these missing cells with empty value, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.</p>


In [None]:
grouped_pivot = grouped_pivot.fillna('') #fill missing values with empty value
grouped_pivot

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1>Question 4:</h1>

<p>Use the "groupby" function to find the average "Patient's Vital Status" of each patient based on "Nottingham prognostic index".</p>
</div>


In [None]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
# grouping results
df_gptest2 = df[['Nottingham prognostic index',"Patient's Vital Status - transformed"]]
grouped_test_bodystyle = df_gptest2.groupby(['Nottingham prognostic index'],as_index= False).mean()
grouped_test_bodystyle

```

</details>


<h4>Variables: Radio Therapy and Nottingham prognostic index vs. Patient's Vital Status</h4>


Let's use a heat map to visualize the relationship between Radio Therapy and Nottingham prognostic index vs Patient's Vital Status.


In [None]:
#use the grouped results
grouped_pivot = grouped_test1.pivot(index='Radio Therapy',columns='Nottingham prognostic index').fillna(0)
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

<p>The heatmap plots the target variable (Patient's Vital Status) proportional to colour with respect to the variables 'Radio Therapy' and 'Nottingham prognostic index' on the vertical and horizontal axis, respectively. This allows us to visualize how the Patient's Vital Status is related to 'Radio Therapy' and 'Nottingham prognostic index'.</p>

<p>The default labels convey no useful information to us. Let's change that:</p>


In [None]:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.</p>

<p>The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the Patient's Vital Status?".</p>

<p>To get a better measure of the important characteristics, we look at the correlation of these variables with the Patient's Vital Status. In other words: how is the Patient's Vital Status dependent on this variable?</p>


<h2 id="correlation_causation">5. Correlation and Causation</h2>


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


In [None]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


<h3>Each column vs. Patient's Vital Status</h3>


Let's calculate the  Pearson Correlation Coefficient and P-value of different columns of float and int type and 'Patient's Vital Status'.


In [None]:
columns = df.select_dtypes(include=['int', 'float']).columns
columns = columns[:-1]

for c in columns:
    pearson_coef, p_value = stats.pearsonr(df[c], df["Patient's Vital Status - transformed"])
    print(c, ":\n Pearson Correlation Coefficient is", round(pearson_coef, 2), " with a P-value of P=", "{:.2e}".format(p_value))


<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between different columns and Patient's Vital Status is statistically significant, although the linear relationship isn't extremely strong. These columns are 'Overall Survival (Months)', 'Relapse Free Status (Months)', 'Relapse Free Status-Not Recurred', 'Relapse Free Status-Recurred', 'Lymph nodes examined positive'.
</p>


<h2 id="anova">6. ANOVA</h2>


<h3>ANOVA: Analysis of Variance</h3>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<h3>Nottingham prognostic index</h3>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'Nottingham prognostic index' impact  'Patient's Vital Status', we group the data.</p>


In [None]:
grouped_test2=df_gptest[['Nottingham prognostic index', "Patient's Vital Status - transformed"]].groupby(['Nottingham prognostic index'])
grouped_test2.head(2)

In [None]:
df_gptest

We can obtain the values of the method group using the method "get_group".


In [None]:
grouped_test2.get_group(1)["Patient's Vital Status - transformed"]

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group(1)["Patient's Vital Status - transformed"], grouped_test2.get_group(2)["Patient's Vital Status - transformed"], grouped_test2.get_group(3)["Patient's Vital Status - transformed"], grouped_test2.get_group(4)["Patient's Vital Status - transformed"], grouped_test2.get_group(5)["Patient's Vital Status - transformed"], grouped_test2.get_group(6)["Patient's Vital Status - transformed"], grouped_test2.get_group(7)["Patient's Vital Status - transformed"])
 
print( "ANOVA results: F=", round(f_val, 2), ", P =", "{:.2e}".format(p_value))

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all seven tested groups are all this highly correlated?

Let's examine them separately.


To compare pairs of groups we can use `itertools`

In [None]:
values = [1, 2, 3, 4, 5, 6, 7]
for a, b in itertools.combinations(values, 2):
    f_val, p_val = stats.f_oneway(grouped_test2.get_group(a)["Patient's Vital Status - transformed"], grouped_test2.get_group(b)["Patient's Vital Status - transformed"])
    print(a, "and", b, "ANOVA results: F=", round(f_val, 2), ", P =", "{:.2e}".format(p_val))

Results represent that some of separately compared pairs of groups have different F and P value from other pairs. So only some of them are highly correlated

<h3>Conclusion: Important Variables</h3>


<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the Patient's Vital Status. We have narrowed it down to the following variables:</p>

Continuous numerical variables:

<ul>
    <li>Overall Survival (Months)</li>
    <li>Relapse Free Status (Months)</li>
    <li>Relapse Free Status-Not Recurred</li>
    <li>Relapse Free Status-Recurred</li>
    <li>Lymph nodes examined positive</li>
    <li>Nottingham prognostic index</li>
    <li>Tumor Stage</li>
</ul>

Categorical variables:

<ul>
    <li>Pam50 + Claudin-low subtype</li>
    <li>Integrative Cluster</li>
    <li>Overall Survival Status</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


Before saving the DataSet, let's remove the unnecessary columns.


In [None]:
df["Cancer Type"].value_counts()

In this particular case, "Breast Sarcoma" only appears three times out of a total of 2509 cases, which is less than 0.1% of the total data. This means that it is unlikely that this value will have a significant impact on any predictions or decisions. Therefore, deleting this column may simplify our analysis and improve its accuracy and interpretability.
We do not need "Overall Survival Status", because our predictive column is "Patient's Vital Status".
The "Sex" column can also be removed because all patients are female.

In [None]:
df = df.drop(["Cancer Type", "Overall Survival Status", "Sex", "Patient\'s Vital Status - transformed"], axis=1)

Save the new csv:

In [None]:
df.to_csv('clean_df.csv')

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_shliakhovskyi">Dmytro Shliakhovskyi</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|    2023-03-11     | 01 | Dmytro Shliakhovkyi | Lab created |



<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>
