#  Observations and Inferences:
#### Instructions: Look across all previously generated figures and tables and write at least three observations or inferences  that can be made from the data.  

> 1. On the samples of data that do get printed out for the various analyses and graphs, one notices that the weight remains the same for a particular mouse throughout all of the Timepoints measurements and also that the Mouse Weight is to the nearest gram.  One assumption that might be made is that the mice are weighed only once, perhaps at the beginning of the experiment and that the original weight is carried forward.  That is why the scatterplot looks like so many vertical lines of points, with all of the mouse weights in integer values (whole numbers) in grams.

> 2. Another curiosity is demonstrated by the box plots of the four Drug Regimens.  Capomulin and Ramicane have comparable boxplots and then Infuibinol and Ceftamin have comparable boxplots.  One may also infer that since a smaller tumor volume is better, Capomulin and Ramicane are more effective drug regimens for cancer than Infuibinol and Ceftamin.

> 3. The other potentially remarkable thing about the four boxplots is that *none* of them show any outliers.  When I went back and examined the arrays of the final tumor volumes against the quartiles for each of the drug regimens, none of the drug regimens had any tumor volumes which would be considered outliers based on the analysis. This may not be the result for other drugs in the study. 

> 4. In looking at the graph with the line produced by linear regression analysis, the larger the weight of the mouse tends to indicate a larger Tumor Volume.  It should be noted that this is a correlation and not necessarily a causation, more analysis would need to be done.

#### Other development notes:
> I made this exercise too hard initially by trying to do the statistical analysis for each Drug Regimen at the Tumor Volume for each Timepoint.  I left in the drug_groups_describe DataFrame to demonstrate that it was possible to do this, even if it was unnecessary to do so.

> The Line Plot of Tumor Volume vs. Timepoint for a single mouse in the Capomulin trial will change those series of cells are run as I am randomly selecting a mouse out of the Capomulin group to build the graph.  Happily, at least for every time I tested the graph, the line had a similar downward slope indicating that the Tumor Volume decreased for all of those mice.


In [None]:
# Dependencies
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import sem
from scipy.stats import linregress
from random import *

In [None]:
# Read CSV
study_results_df = pd.read_csv('data/Study_results.csv')
study_results_df


In [None]:
mouse_meta_df = pd.read_csv('data/Mouse_metadata.csv')
mouse_meta_df


In [None]:
# check for mouse ID with duplicate time points ... remove any data associated with that mouse ID
study_results_df.duplicated(subset=['Timepoint', 'Mouse ID']).sum()

dupes = study_results_df[study_results_df.duplicated(subset=['Mouse ID','Timepoint'])]
dupes
#dupes['Mouse ID'].unique()

In [None]:
# Dom's suggestion at office hours
clean_study_results_df = study_results_df.loc[study_results_df['Mouse ID']!=dupes.iloc[0,0]]
clean_study_results_df

In [None]:
# Join dataframes for next
#mouse_meta_df.join(study_results_df,on='Mouse ID', how="inner")
new_df = clean_study_results_df.merge(mouse_meta_df, how='inner', left_on='Mouse ID', right_on='Mouse ID', suffixes=('l_','r_'))
new_df

In [None]:
# first attempt at a summary statistics table, SEE BELOW 
drug_groups_describe = pd.DataFrame(new_df.groupby(['Drug Regimen','Timepoint'])['Tumor Volume (mm3)'].describe())

drug_groups_describe

In [None]:
# Generate a summary statistics table consisting of mean, median, variance, standard deviation, and SEM 
# of the tumor volume for each drug regimen

drug_groups_df = pd.DataFrame(new_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].mean())
drug_groups_df = drug_groups_df.rename(columns={'Tumor Volume (mm3)':'mean'})

drug_median = pd.DataFrame(new_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].median())
drug_median = drug_median.rename(columns={'Tumor Volume (mm3)':'median'})
drug_groups_df = pd.concat([drug_groups_df, drug_median], axis=1)

drug_var = pd.DataFrame(new_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].var())
drug_var = drug_var.rename(columns={'Tumor Volume (mm3)':'var'})
drug_groups_df = pd.concat([drug_groups_df, drug_var], axis=1)

drug_std = pd.DataFrame(new_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].std())
drug_std = drug_std.rename(columns={'Tumor Volume (mm3)':'std'})
drug_groups_df = pd.concat([drug_groups_df, drug_std], axis=1)

drug_sem = pd.DataFrame(new_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].sem())
drug_sem = drug_sem.rename(columns={'Tumor Volume (mm3)':'sem'})
drug_groups_df = pd.concat([drug_groups_df, drug_sem], axis=1)
drug_groups_df

In [None]:
# Generate a bar plot that shows the total number of measurements taken for each treatment regimen thru-out the study
## using Panda's DataFrame.plot

drug_list = new_df['Drug Regimen'].unique()

measurement_count = new_df.groupby('Drug Regimen')['Mouse ID'].count()

df = pd.DataFrame({'Drug Regimen':drug_list,'Measurement Count':measurement_count})
ax = df.plot.bar(x='Drug Regimen', y='Measurement Count')
plt.show()

In [None]:
## using Matplotlib 'pyplot'

x_axis = np.arange(len(measurement_count))

plt.bar(x_axis, measurement_count, color='blue', alpha=0.5, align="center")

tick_locations = [value for value in x_axis]
plt.xticks(tick_locations, drug_list, rotation="vertical")
plt.title("Total Measurements Taken for Each Treatment")
plt.xlabel("Drug Regimens")
plt.ylabel("Measurement Count")
plt.show()

In [None]:
# Generate a pie plot that shows the distribution of female or male mice in the study
## using Pandas's DataFrame.plot()

new_df.groupby('Sex')['Mouse ID'].count().plot(kind='pie', y='Sex', shadow=True,  startangle=120, autopct='%1.1f%%')

plt.show()

In [None]:
## using Matplotlib's 'pyplot'
sex = new_df.groupby('Sex').count()
labels = ["Females", "Males"]
sex_count = [922, 958]
explode = (0.1,0)
colors = ["lightcoral","lightskyblue"]
plt.pie(sex_count, explode=explode, labels=labels, colors=colors, autopct="%1.1f%%", startangle=90)
plt.axis("equal")
plt.show()

In [None]:
# Calculate the final tumor volume of each mouse across four of the most promising treatment regimens:
## Capomulin, Ramicane, Infuibinol, and  Ceftamin

# Mouse ID, Timepoint =max?, Tumor volume, Drug Regimen
final_tumor_df = new_df[['Drug Regimen', 'Mouse ID', 'Timepoint', 'Tumor Volume (mm3)']]

final_tumor_cap = final_tumor_df.loc[final_tumor_df['Drug Regimen'] == 'Capomulin']
final_tumor_cap = final_tumor_cap.loc[final_tumor_cap['Timepoint'] == final_tumor_cap['Timepoint'].max()]

final_tumor_cap

In [None]:
final_tumor_ram = final_tumor_df.loc[final_tumor_df['Drug Regimen'] == 'Ramicane']
final_tumor_ram = final_tumor_ram.loc[final_tumor_ram['Timepoint'] == final_tumor_ram['Timepoint'].max()]

final_tumor_ram

In [None]:
final_tumor_inf = final_tumor_df.loc[final_tumor_df['Drug Regimen'] == 'Infubinol']
final_tumor_inf = final_tumor_inf.loc[final_tumor_inf['Timepoint'] == final_tumor_inf['Timepoint'].max()]

final_tumor_inf

In [None]:
final_tumor_cef = final_tumor_df.loc[final_tumor_df['Drug Regimen'] == 'Ceftamin']
final_tumor_cef = final_tumor_cef.loc[final_tumor_cef['Timepoint'] == final_tumor_cef['Timepoint'].max()]

final_tumor_cef

In [None]:
# Calculate the quartiles and IQR and quantitatively determine if there are any potential outliers across 
# all four treatment regimens.
cap_tumors = np.asarray(final_tumor_cap['Tumor Volume (mm3)'])
cap_tumors

In [None]:
# Calculate the quartiles and IQR 
cap_quartiles = pd.DataFrame(cap_tumors).quantile([.25,.5,.75], axis=0)
cap_quartiles

In [None]:
cap_lowerq = cap_quartiles.loc[0.25]  #TODO: this reference needs to be cleaned up
cap_median = cap_quartiles.loc[0.50]
cap_upperq = cap_quartiles.loc[0.75]
cap_iqr = cap_upperq - cap_lowerq

print(f'The lower quartile of tumor volume for Capomulin is: {cap_lowerq[0]}')
print(f'The upper quartile of tumor volume for Capomulin is: {cap_upperq[0]}')
print(f'The interquartile range of tumor volume for Capomulin is: {cap_iqr[0]}')
print(f'The median of tumor volume for Capomulin is: {cap_median[0]}')

cap_lower_bound = cap_lowerq - (1.5*cap_iqr)
cap_upper_bound = cap_upperq + (1.5*cap_iqr)

print(f'Capomulin tumor volume below {cap_lower_bound[0]} could be outliers.')
print(f'Capomulin tumor volume above {cap_upper_bound[0]} could be outliers.')

In [None]:
ram_tumors = np.asarray(final_tumor_ram['Tumor Volume (mm3)'])
ram_tumors

In [None]:
# Calculate the quartiles and IQR
ram_quartiles = pd.DataFrame(ram_tumors).quantile([.25,.5,.75], axis=0)
ram_quartiles

In [None]:
ram_lowerq = ram_quartiles.loc[0.25]  #TODO: this reference needs to be cleaned up
ram_median = ram_quartiles.loc[0.50]
ram_upperq = ram_quartiles.loc[0.75]
ram_iqr = ram_upperq - ram_lowerq

print(f'The lower quartile of tumor volume for Ramicane is: {ram_lowerq[0]}')
print(f'The upper quartile of tumor volume for Ramicane is: {ram_upperq[0]}')
print(f'The interquartile range of tumor volume for Ramicane is: {ram_iqr[0]}')
print(f'The median of tumor volume for Ramicane is: {ram_median[0]}')

ram_lower_bound = ram_lowerq - (1.5*ram_iqr)
ram_upper_bound = ram_upperq + (1.5*ram_iqr)

print(f'Ramicane tumor volume below {ram_lower_bound[0]} could be outliers.')
print(f'Ramicane tumor volume above {ram_upper_bound[0]} could be outliers.')

In [None]:
# Infubinol
inf_tumors = np.asarray(final_tumor_inf['Tumor Volume (mm3)'])
inf_tumors

In [None]:
inf_quartiles = pd.DataFrame(inf_tumors).quantile([.25,.5,.75], axis=0)
inf_quartiles

In [None]:
inf_lowerq = inf_quartiles.loc[0.25]  #TODO: this reference needs to be cleaned up
inf_median = inf_quartiles.loc[0.50]
inf_upperq = inf_quartiles.loc[0.75]
inf_iqr = inf_upperq - inf_lowerq

print(f'The lower quartile of tumor volume for Infubinol is: {inf_lowerq[0]}')
print(f'The upper quartile of tumor volume for Infubinol is: {inf_upperq[0]}')
print(f'The interquartile range of tumor volume for Infubinol is: {inf_iqr[0]}')
print(f'The median of tumor volume for Infubinol is: {inf_median[0]}')

inf_lower_bound = inf_lowerq - (1.5*inf_iqr)
inf_upper_bound = inf_upperq + (1.5*inf_iqr)

print(f'Infubinol tumor volume below {inf_lower_bound[0]} could be outliers.')
print(f'Infubinol tumor volume above {inf_upper_bound[0]} could be outliers.')

In [None]:
# Ceftamin
cef_tumors = np.asarray(final_tumor_cef['Tumor Volume (mm3)'])
cef_tumors

In [None]:
cef_quartiles = pd.DataFrame(cef_tumors).quantile([.25,.5,.75], axis=0)
cef_quartiles

In [None]:
cef_lowerq = cef_quartiles.loc[0.25]  #TODO: this reference needs to be cleaned up
cef_median = cef_quartiles.loc[0.50]
cef_upperq = cef_quartiles.loc[0.75]
cef_iqr = cef_upperq - cef_lowerq

print(f'The lower quartile of tumor volume for Ceftamin is: {cef_lowerq[0]}')
print(f'The upper quartile of tumor volume for Ceftamin is: {cef_upperq[0]}')
print(f'The interquartile range of tumor volume for Ceftamin is: {cef_iqr[0]}')
print(f'The median of tumor volume for Ceftamin is: {cef_median[0]}')

cef_lower_bound = cef_lowerq - (1.5*cef_iqr)
cef_upper_bound = cef_upperq + (1.5*cef_iqr)

print(f'Ceftamin tumor volume below {cef_lower_bound[0]} could be outliers.')
print(f'Ceftamin tumor volume above {cef_upper_bound[0]} could be outliers.')

In [None]:
# Capomulin, Ramicane, Infuibinol, and  Ceftamin

columns = [cap_tumors, ram_tumors, inf_tumors, cef_tumors]

fig, ax = plt.subplots()
ax.set_title('Drug Regimens')
ax.set_ylabel('Tumor Volume (mm3)')
labels=["Capomulin", "Ramicane", "Infuibinol","Ceftamin"]
ax.boxplot(columns, labels=labels)
plt.show()

In [None]:
# Select a mouse that was treated with Capomulin 
# Drug Regimen, Mouse ID, Timepoint, Tumor volume, Weight
capomulin_df = new_df[['Drug Regimen', 'Mouse ID', 'Timepoint', 'Tumor Volume (mm3)', 
                        'Weight (g)']]

capomulin_df = capomulin_df.loc[capomulin_df['Drug Regimen'] == 'Capomulin']
capomulin_df

In [None]:
mice = capomulin_df['Mouse ID']
mice

In [None]:
mouse_arr = mice.unique()
mouse_arr

In [None]:
x = randint(0, 24)    # Pick a random number between 0 and 25.
print(f'{x} and mouse id:{mouse_arr[x]}.')

mouse = capomulin_df.loc[capomulin_df['Mouse ID'] == mouse_arr[x]]      #'b128']
mouse

In [None]:
# generate a line plot of tumor volume vs. time point for that mouse

plt.plot(mouse['Timepoint'], mouse['Tumor Volume (mm3)'])
plt.xlabel("Timepoint")
plt.ylabel("Tumor Volume (mm3)")
plt.title(f'Tumor vs Time for Mouse ID {mouse_arr[x]}')
plt.show()

In [None]:
# Generate a scatter plot of tumor volume versus mouse weight for the Capomulin treatment regimen
x_values = capomulin_df['Weight (g)']
y_values = capomulin_df['Tumor Volume (mm3)']

# Calc the correlation coefficient and linear regression model between mouse weight and
# average tumor volume for the Capomulin treatment.  

(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

# Plot the linear regression model on top of the previous scatter plot
plt.scatter( x_values, y_values, marker="o", facecolor='blue', alpha=0.75)
plt.plot(x_values, regress_values, "r-")
plt.annotate(line_eq, (18,25), fontsize=14, color='red')
plt.ylabel("Tumor Volume (mm3)")
plt.xlabel("Weight (g)")
plt.title(f'Capomulin: Tumor vs Mouse Weight')
plt.show()