## Observations and Insights 

Trends and Observations:
*  From the line graph between Tumor Volume and timepoint we can see that as days are passing, tumor size is decreasing for Capomulin drug. The same trend can be seen by the boxplots also.
* As seen in the boxplot, the drug "Infubinol" has an outlier in its dataset, Therefore the stats like mean of tumor volume is gets effected by it.
* For drug regime "Capomulin", As the tumor volume increases, weight of the mouse also increases with it. This could be confirmed by the correlation co-efficient of between the two which is 0.84.
* Number of Male and Female mice were approximately equal in the experiment.


In [1]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
%matplotlib widget

In [2]:
# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)


# Combine the data into a single dataset
combined_data = pd.merge(mouse_metadata,study_results,on="Mouse ID",how="outer")

In [3]:
# Display the data table for preview
combined_data.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1


The Number of Rows and Columns in the dataset are 1893 and 8 respectively. 

In [4]:
combined_data.shape

(1893, 8)

We can inspect the null values in the dataset using info method. In all the columns there are no null values as seen in the output. 

In [5]:
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1893 entries, 0 to 1892
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Mouse ID            1893 non-null   object 
 1   Drug Regimen        1893 non-null   object 
 2   Sex                 1893 non-null   object 
 3   Age_months          1893 non-null   int64  
 4   Weight (g)          1893 non-null   int64  
 5   Timepoint           1893 non-null   int64  
 6   Tumor Volume (mm3)  1893 non-null   float64
 7   Metastatic Sites    1893 non-null   int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 110.9+ KB


The total Number of Mice used for the experiement is 249 divided between each drug regime.

In [6]:
# Checking the number of mice.
Total_mouse = combined_data["Mouse ID"].nunique()
print(f"Total number of mice used: {Total_mouse}")


Total number of mice used: 249


## Cleaning of DataSet
We have checked if there are any mouse ID which has duplicate data based on timepoint. The mouse ID g989 has some duplicate rows for some timepoint values. So before we start analysing the data, we need to remove these rows from the data set.

In [7]:
# Getting the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate_rows = combined_data[combined_data.duplicated(subset=["Mouse ID","Timepoint"])]
Duplicate_rows_mouseID = duplicate_rows.iloc[0,0]
print(f"The mouse ID which has duplicate data : {Duplicate_rows_mouseID}")
duplicate_rows

The mouse ID which has duplicate data : g989


Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
909,g989,Propriva,Female,21,26,0,45.0,0
911,g989,Propriva,Female,21,26,5,47.570392,0
913,g989,Propriva,Female,21,26,10,49.880528,0
915,g989,Propriva,Female,21,26,15,53.44202,0
917,g989,Propriva,Female,21,26,20,54.65765,1


In [8]:
# Optional: Get all the data for the duplicate mouse ID. 
print("All the data for the duplicate mouse ID. ")
combined_data.loc[combined_data["Mouse ID"]=="g989",:]

All the data for the duplicate mouse ID. 


Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
908,g989,Propriva,Female,21,26,0,45.0,0
909,g989,Propriva,Female,21,26,0,45.0,0
910,g989,Propriva,Female,21,26,5,48.786801,0
911,g989,Propriva,Female,21,26,5,47.570392,0
912,g989,Propriva,Female,21,26,10,51.745156,0
913,g989,Propriva,Female,21,26,10,49.880528,0
914,g989,Propriva,Female,21,26,15,51.325852,1
915,g989,Propriva,Female,21,26,15,53.44202,0
916,g989,Propriva,Female,21,26,20,55.326122,1
917,g989,Propriva,Female,21,26,20,54.65765,1


We have dropped the duplicate data from the orignal dataset. Now the total number of rows remaining in the dataset is 1888.

In [9]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
combined_data = combined_data.drop_duplicates(subset=["Mouse ID","Timepoint"],keep = "first")


In [10]:
# Checking the number of mice in the clean DataFrame.
combined_data.count()

Mouse ID              1888
Drug Regimen          1888
Sex                   1888
Age_months            1888
Weight (g)            1888
Timepoint             1888
Tumor Volume (mm3)    1888
Metastatic Sites      1888
dtype: int64

## Summary Statistics
We have found out the statistics of tumor volume like mean, median, varience, standard deviation, grouped by Drug Regime. The lowest mean of Tumor volume is for drugs  Capomulin and Ramicane, but the higest mean of Tumor Volume belongs to drugs Ketapril and Naftisol.

In [11]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen
round(combined_data.groupby("Drug Regimen").agg({'Tumor Volume (mm3)' : ["mean",'median','var','std','sem']}),2)

Unnamed: 0_level_0,Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3)
Unnamed: 0_level_1,mean,median,var,std,sem
Drug Regimen,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Capomulin,40.68,41.56,24.95,4.99,0.33
Ceftamin,52.59,51.78,39.29,6.27,0.47
Infubinol,52.88,51.82,43.13,6.57,0.49
Ketapril,55.24,53.7,68.55,8.28,0.6
Naftisol,54.33,52.51,66.17,8.13,0.6
Placebo,54.03,52.29,61.17,7.82,0.58
Propriva,52.39,50.91,43.14,6.57,0.53
Ramicane,40.22,40.67,23.49,4.85,0.32
Stelasyn,54.23,52.43,59.45,7.71,0.57
Zoniferol,53.24,51.82,48.53,6.97,0.52


## Bar and Pie Charts

Bar chart for drug Regime and number of mice used for the drug regime states that maximum number of Mice are used for drugs Capomulin and Ramicane.

In [12]:
mouseID_per_regime = pd.DataFrame(combined_data.groupby("Drug Regimen")["Mouse ID"])
mouse_count = [len(x.unique()) for x in mouseID_per_regime[1]]
mouse_count
drug_regime = [x for x in mouseID_per_regime[0]]
drug_regime

['Capomulin',
 'Ceftamin',
 'Infubinol',
 'Ketapril',
 'Naftisol',
 'Placebo',
 'Propriva',
 'Ramicane',
 'Stelasyn',
 'Zoniferol']

In [13]:
# Generate a bar plot showing the total number of mice for each treatment throughout the course of the study using pandas. 
mouseID_per_regime = pd.DataFrame(combined_data.groupby("Drug Regimen")["Mouse ID"].count())
mouseID_per_regime

mice_count_per_treatment = pd.DataFrame({
    "Drug Regime" : drug_regime,
    "Number of mice" : mouseID_per_regime["Mouse ID"]
})
# mice_count_per_treatment
mice_count_per_treatment.plot(kind="bar",x="Drug Regime",y="Number of mice")
plt.tight_layout()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [14]:
# Generate a bar plot showing the total number of mice for each treatment throughout the course of the study using pyplot.
x_axis = drug_regime
users = mouseID_per_regime["Mouse ID"]
fig9, ax9 = plt.subplots()
ax9.bar(x_axis, users, color='red', alpha=0.5, align="center")
plt.title("Total number of Mice for each Treatment")
plt.xlabel("Drugs")
plt.ylabel("Number of Mice")
plt.xticks(rotation=90)
plt.tight_layout()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Pie Chart for number of Male and Female shows difference between the male and female number is almost negligible. Approximately same number of Male and Female mouse where taken for the experiment.  

In [15]:
# Generate a pie plot showing the distribution of female versus male mice using pandas
group_by_gender = pd.DataFrame(combined_data.groupby("Sex")["Sex"].count())
group_by_gender.plot(kind="pie",y="Sex")
plt.title("Number of female & Male Mice")
#group_by_gender

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 1.0, 'Number of female & Male Mice')

In [16]:

group_by_gender

labels = ["Female", "Male"]
sizes = list(group_by_gender["Sex"])
colors = ["pink","lightblue"]
explode = (0.1, 0)
fig8, ax8 = plt.subplots()
ax8.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct="%1.1f%%", shadow=True, startangle=140);
plt.title("Number of female & Male Mice")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 1.0, 'Number of female & Male Mice')

## Quartiles, Outliers and Boxplots

We have calulated final tumor volume of each mouse across four of the treatment regimes:Capomulin, Ramicane, Infubinol, and 
Ceftamin.
* We start by finding the greatest timepoint for each mouse.
* Created a new column named "Maximum Timepoint" by merging the greatest timepoint data into orignal dataframe.
* Created empty Dictionary to fill with tumor vol data.
* Loop through the combined data to find out tumor volume for highest timepoint in each of the mentioned Teartment.
* Finding the quartiles of the tumor volume of each drug.

In [17]:
# Start by getting the last (greatest) timepoint for each mouse
group_by_mouse = combined_data.groupby("Mouse ID")
last_timepoint = pd.DataFrame(group_by_mouse["Timepoint"].max())
last_timepoint

# # # Merge this group df with the original dataframe to get the tumor volume at the last timepoint
combined_data = pd.merge(combined_data,last_timepoint,on="Mouse ID",how="outer")
combined_data
combined_data = combined_data.rename(columns={"Timepoint_y" : "Maximum Timepoint"})

In [18]:
# Put treatments into a list for for loop (and later for plot labels)
treatments = ["Capomulin", "Ramicane", "Infubinol", "Ceftamin"]

# Create empty Dictionary to fill with tumor vol data (for plotting)
tumor_vol_data = {
    "Capomulin" :[] ,
    "Ramicane" : [],
    "Infubinol" : [],
    "Ceftamin" : []
}

# Loop through the combined data to find out tumor volume for highest timepoint in 
# each of the mentioned Teartment.
for index, row in combined_data.iterrows():
    if row["Maximum Timepoint"] == row['Timepoint_x'] and row["Drug Regimen"] in treatments:
         tumor_vol_data[row["Drug Regimen"]].append(row["Tumor Volume (mm3)"])

# Finding the quartiles of the tumor volume of each drug.
tumor_vol_data = pd.DataFrame(tumor_vol_data)
quartiles_data = pd.DataFrame(tumor_vol_data.quantile([0.25,0.5,0.75]))
print("The upper Bound and Lower bound of the Each Durg Regime is:")
round(quartiles_data,2)   

The upper Bound and Lower bound of the Each Durg Regime is:


Unnamed: 0,Capomulin,Ramicane,Infubinol,Ceftamin
0.25,32.38,31.56,54.05,48.72
0.5,38.13,36.56,60.17,59.85
0.75,40.16,40.66,65.53,64.3


* Major amount of values of Tumor volume in Capomulin Drug Regime is limited between 32.38 and 40.16.
* In Drug Regime "Infubinol", tumor volume has quite the high values in the upper and lower bound values.

In [19]:
# Finding IQR for each Durg Regime

quartiles_data = quartiles_data.set_index(pd.Index([1, 2, 3]))
quartiles_data
IQR_data = pd.DataFrame({
    "Capomulin" : [quartiles_data.loc[3,"Capomulin"] - quartiles_data.loc[1,"Capomulin"]],
    "Ramicane" : quartiles_data.loc[3,"Ramicane"] - quartiles_data.loc[1,"Ramicane"],
    "Infubinol" : quartiles_data.loc[3,"Infubinol"] - quartiles_data.loc[1,"Infubinol"],
    "Ceftamin" : quartiles_data.loc[3,"Ceftamin"] - quartiles_data.loc[1,"Ceftamin"]
})
print("IQR for each Drug Regime is:")
round(IQR_data,2)

IQR for each Drug Regime is:


Unnamed: 0,Capomulin,Ramicane,Infubinol,Ceftamin
0,7.78,9.1,11.48,15.58


In [20]:
# Calculate the upper bound and Lower Bound for each drug.
lower_bound = []
upper_bound = []
i = 0 
for index, col in quartiles_data.iteritems():
    lower_bound.append(col[1]-(1.5* IQR_data.iloc[0,i]))
    upper_bound.append(col[3] + 1.5 * IQR_data.iloc[0,i])
    print(f"Lower bound of {treatments[i]} is:{round(lower_bound[-1],2)}")
    print(f"Upper bound of {treatments[i]} is:{round(upper_bound[-1],2)}")
    i = i + 1

Lower bound of Capomulin is:20.7
Upper bound of Capomulin is:51.83
Lower bound of Ramicane is:17.91
Upper bound of Ramicane is:54.31
Lower bound of Infubinol is:36.83
Upper bound of Infubinol is:82.74
Lower bound of Ceftamin is:25.36
Upper bound of Ceftamin is:87.67


In [21]:
# Check if any values are outside the lower bound and upper bound of the mentioned drug regime
tumor_vol_data
for index,row in tumor_vol_data.iterrows():
    if (row["Capomulin"] < lower_bound[0] or  row["Capomulin"] > upper_bound[0]):
        print(f"Outlier in Capomulin: {row['Capomulin']}")
    if (row["Ramicane"] < lower_bound[1] or  row["Ramicane"] > upper_bound[1]):
        print(f"Outlier in Ramicane: {row['Ramicane']}")
    if (row["Infubinol"] < lower_bound[2] or row["Infubinol"] > upper_bound[2]):
        print(f"Outlier in Infubinol: {row['Infubinol']}")
    if (row["Ceftamin"] < lower_bound[3] or  row["Ceftamin"] > upper_bound[3]):
        print(f"Outlier in Ceftamin: {row['Ceftamin']}")
    

Outlier in Infubinol: 36.321345799999996


In [22]:
# Generate a box plot of the final tumor volume of each mouse across four regimens of interest
tumor_vol_data = pd.DataFrame(tumor_vol_data)   
    # Locate the rows which contain mice on each drug and get the tumor volumes
    

    # add subset 
data = [tumor_vol_data["Capomulin"],tumor_vol_data["Ramicane"],tumor_vol_data["Infubinol"],tumor_vol_data["Ceftamin"]]
fig1, ax1 = plt.subplots()
ax1.set_title('Multiple Samples with Different sizes')
ax1.boxplot(data,0,'r')

plt.show()
    

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Line and Scatter Plots

In [23]:
# Generate a line plot of time point versus tumor volume for a mouse treated with Capomulin
Capomulin_mouse= pd.DataFrame(combined_data.loc[combined_data["Drug Regimen"] == "Capomulin",:])
sample_mouse = Capomulin_mouse.iloc[0,0]

Line_plot_data = pd.DataFrame(combined_data.loc[combined_data["Mouse ID"] == sample_mouse,["Mouse ID","Timepoint_x","Tumor Volume (mm3)"]])
#sample_mouse
x_values = Line_plot_data['Timepoint_x']
y_values = Line_plot_data['Tumor Volume (mm3)']

fig2, ax2 = plt.subplots()
ax2.set_title(f'Timepoint and Tumor Volume(mm3) for mouse ID {sample_mouse}')
ax2.plot(x_values,y_values)
ax2.grid()
plt.xlabel('Timepoint')
plt.ylabel('Tumor Volume (mm3)')
plt.show()


Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [24]:
# Generate a scatter plot of mouse weight versus average tumor volume for the Capomulin regimen
Avg_tumor_data = pd.DataFrame(Capomulin_mouse.groupby("Mouse ID")["Tumor Volume (mm3)"].mean())
Weight_data= pd.DataFrame(Capomulin_mouse.groupby("Mouse ID")["Weight (g)"].value_counts())
Weight_data = Weight_data.rename(columns={"Weight (g)":"Weight","Weight (g)":"count"})
Weight_data = Weight_data.reset_index(level=['Weight (g)'])


x_values = Weight_data['Weight (g)']
y_values = Avg_tumor_data['Tumor Volume (mm3)']
fig3, ax3 = plt.subplots()
ax3.set_title('Weight and Tumor Volume(mm3)')
ax3.scatter(x_values,y_values)
plt.xlabel('Weight')
plt.ylabel('Tumor Volume (mm3)')
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

## Correlation and Regression

In [25]:
# Calculate the correlation coefficient and linear regression model 
# for mouse weight and average tumor volume for the Capomulin regimen
correlation = st.pearsonr(x_values,y_values)
print(f"The correlation between both factors is {round(correlation[0],2)}")

The correlation between both factors is 0.84


In [26]:
(slope, intercept, rvalue, pvalue, stderr) = st.linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
ax3.plot(x_values,regress_values,"r-")
ax3.annotate(line_eq,xy=(20, 30),fontsize=15,color="red")
print(f"The r-squared is: {rvalue**2}")
print(line_eq)
plt.show()

The r-squared is: 0.7088568047708715
y = 0.95x + 21.55
