## Observations and Insights 

1. The Summary Statistics dataframe shows both Capomulin and Ramicane have smallest Mean tumor size of all the drug regimen tested, with mean tumor volumes of 40.67 and 40.21 respectively. Looking further, Ramicane has a slightly lower variance 23.46 vs Capomulin at 24.94.  The small variane indicates that Ramicane data points are closer to the mean that that of Capomulin. The other drugs in their cohort showed tumor volumes that were 28% - 37% larger. 
1. The drug regimen bar charts show that both Capumulin and Ramicane have a higher number of measurements than their cohorts. A higher number of measurements could allow for improved precision of the estimates and in the strength of the study.   
1. In examining the IQR results, both Capumulin and Ramicane don't have any outliers. There is a mixed picture when looking at the lower, upper and median quartile for tumor volume. The numbers are close enough that one can't discern which drug has the best results. However, when compared to their cohorts, both had significantly lower Q1 and Q3 ranges. 
1. The boxplot paints a better picture, visually dipicting Capumulin and Ramicane had lower tumor volumes. We can also see that Infubinol has an outlier below its minimum IQR.
1. We can see that there is a strong correlation between the weight of the mouse and the size of tumor. The scatter chart with line regression shows that as the weight of the mouse increased, so did the tumor. This was also confirmed by the correlation coefficient of mouse weight to tumor volume, which was 0.84. This shows a very strong correlation.
1. The data shows that both Capumulin and Ramicane had a significant impact in reducing tumor size over those in their cohort.

In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import pandas as pd
import dataframe_image as dfi
from scipy.stats import pearsonr
from scipy.stats import linregress
import numpy as np
from IPython import display

from jupyterthemes import jtplot
jtplot.style()

# Study data files
mouse_metadata_path = 'data/Mouse_metadata.csv'
study_results_path = 'data/Study_results.csv'

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single dataset and display
df_mouse_merge = mouse_metadata.merge(study_results, left_on = 'Mouse ID', right_on = 'Mouse ID')
df_mouse_merge.head()

In [None]:
# Checking the number of mice.
mouse_qty = len(pd.unique(df_mouse_merge['Mouse ID']))
mouse_qty

In [None]:
# Getting the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate_mouse = df_mouse_merge.loc[df_mouse_merge.duplicated(subset=[
    'Mouse ID', 'Timepoint']), 'Mouse ID'].unique()
duplicate_mouse

In [None]:
# Optional: Get all the data for the duplicate mouse ID. 
show_duplicates = df_mouse_merge.loc[df_mouse_merge['Mouse ID'] == 'g989']
show_duplicates.head()

In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
clean_mouse = df_mouse_merge.drop_duplicates().reset_index(drop=True)
clean_df = clean_mouse[clean_mouse['Mouse ID'].isin(duplicate_mouse)==False]
clean_df.head()

In [None]:
# Checking the number of mice in the clean DataFrame.
mouse_qty = clean_df['Mouse ID'].nunique()
mouse_qty

## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, 
# and SEM of the tumor volume for each regimen
mean_stat = clean_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].mean()
median_stat= clean_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].median()
var_stat = clean_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].var()
stdv_stat = clean_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].std()
sem_stat = clean_df.groupby('Drug Regimen')['Tumor Volume (mm3)'].sem()

summary_df = pd.DataFrame({'Mean': mean_stat, 'Median': median_stat, 'Variance': 
                           var_stat, 'Std. Dev.': stdv_stat, 'SEM': sem_stat})
dfi.export(summary_df, '../Images/sum_stats.png')   # Export dataframe as image
summary_df

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, 
# and SEM of the tumor volume for each regimen using the aggregation method, 
# produce the same summary statistics in a single line
agg_group = clean_df.groupby('Drug Regimen')
agg_summary = agg_group.agg(['mean', 'median', 'var', 'std', 'sem'])['Tumor Volume (mm3)']
agg_summary.head()

## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of measurements taken on each drug regimen using pandas.
regimen_df = clean_df.groupby(['Drug Regimen']).count().reset_index()
regimen_data = regimen_df[['Drug Regimen', 'Mouse ID']].rename(columns={'Mouse ID': 'Count'})
regimen_data = regimen_data.set_index('Drug Regimen')
regimen_data.plot(kind='bar', color='royalblue', figsize=(10,5,))

plt.grid(b=None)  # hide grid lines
plt.ylabel('Number of Measurements')

plt.gca().get_legend().remove()  # turn legend off
#plt.gca().get_bar().set_color('royalblue')
plt.title('Drug Regimen Measurements')
plt.show()

In [None]:
# Generate a bar plot showing the total number of measurements taken on each drug regimen using pyplot.
# create drug regiment data set
regimen_list = summary_df.index.tolist()
x_axis = regimen_list

# Create regimen count
regimen_count = (clean_df.groupby(['Drug Regimen'])['Mouse ID'].count()).tolist()
fig = plt.figure(figsize =(10, 5))  # format figure Size

# Format title
plt.title('Drug Regimen Measurements')
plt.xlabel('Drug Regimen')
plt.ylabel('Number of Measurements')
plt.xticks(rotation=90)
plt.grid(b=None)   # hide grid lines

# Plot bar chart
plt.bar(x_axis, regimen_count, color='royalblue', alpha=1, width=.5, align='center')
#plt.tight_layout()

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pandas
# Create dataframe grouping unique players by Gender
gender_count = pd.DataFrame(clean_df.groupby('Sex')['Mouse ID'].nunique())

# Rename column
gender_count = gender_count.rename(columns={'Mouse ID':'Total Count'})

# Add column Percentage and calculate gender percentage
gender_count['Percentage Split'] = gender_count['Total Count'] / sum(gender_count['Total Count'])

# Plot
explode = (0.1, 0)
plot = gender_count.plot.pie(title='Male vs. Female Mouse Population', y='Total Count', figsize=(6, 6), colors = ['pink', 'royalblue'], startangle=140, explode = explode, shadow = True, autopct="%1.1f%%")
plot.set_ylabel("")
plot.get_legend().remove()  # turn legend off
plt.tight_layout()

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot
# gender count female, male
gender_count = (clean_df.groupby(['Sex'])['Mouse ID'].count()).tolist()

# labels for the sections of the pie chart
labels = ['Females', 'Males']
plt.title('Male vs Female Mouse Population')

sizes = gender_count            # The values of each section of the pie chart
colors = ['#ff9999','#66b3ff']  # Colors each section of the pie chart
explode = (0.1, 0)              # Tells matplotlib not to seperate the sections

# Creates the pie chart based upon the values above
# Automatically finds the percentages of each part of the pie chart
plt.pie(sizes, explode=explode, labels=labels, colors=['pink', 'royalblue'],
        autopct="%1.1f%%", shadow=True, startangle=140)
plt.tight_layout()

## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin

# Retrieve rows of specified Drug Regimen
Capomulin_df = clean_df.loc[clean_df['Drug Regimen'] == 'Capomulin']
Ramicane_df = clean_df.loc[clean_df['Drug Regimen'] == 'Ramicane']
Infubinol_df = clean_df.loc[clean_df['Drug Regimen'] == 'Infubinol']
Ceftamin_df = clean_df.loc[clean_df['Drug Regimen'] == 'Ceftamin']

# Final tumor volume at max timepoint
capo_max = Capomulin_df.groupby('Mouse ID')['Timepoint'].max()
rami_max = Ramicane_df.groupby('Mouse ID')['Timepoint'].max()
infu_max = Infubinol_df.groupby('Mouse ID')['Timepoint'].max()
ceft_max = Ceftamin_df.groupby('Mouse ID')['Timepoint'].max()

# Merge final tumor vol with clean_df dataframe to get the tumor volume at the last timepoint
capo_merge = pd.merge(capo_max, clean_df, on= ('Mouse ID', 'Timepoint'),how='left')
rami_merge = pd.merge(rami_max, clean_df, on= ('Mouse ID', 'Timepoint'),how='left')
infu_merge = pd.merge(infu_max, clean_df, on= ('Mouse ID', 'Timepoint'),how='left')
ceft_merge = pd.merge(ceft_max, clean_df, on= ('Mouse ID', 'Timepoint'),how='left')
capo_merge.head()  # used to validate dataframe

In [None]:
# Put treatments into a list for for loop (and later for plot labels)
list_key_drugs = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

# Create empty list to fill with tumor vol data (for plotting)
empty_tumor_list = []
                  
# Calculate the IQR and quantitatively determine if there are any potential outliers. 
    # Locate the rows which contain mice on each drug and get the tumor volumes
    # add subset 
    # Determine outliers using upper and lower bounds

In [None]:
# Capomulin. Calculate the IQR and quantitatively determine if there are any potential outliers. 
capo_quar = capo_merge['Tumor Volume (mm3)']
quartiles = capo_quar.quantile([.25,.5,.75])
lowerquart = quartiles[.25]
upperquart = quartiles[.75]
iqr = upperquart-lowerquart

print(f'The lower quartile of the tumor volume is: {lowerquart}')
print(f'The upper quartile of the tumor volume is: {upperquart}')
print(f'The interquartile range of the tumor volume is: {iqr}')
print(f'The the median of tumor the volume is: {quartiles[.5]} ')

# Determine outliers by using upper and lower bounds
high = upperquart + (1.5*iqr)
low = lowerquart - (1.5*iqr)    

print(f'Values below {low} could be outliers.')
print(f'Values above {high} could be outliers.')

In [None]:
# Ceftamin. Calculate the IQR and quantitatively determine if there are any potential outliers. 
ceft_quar = ceft_merge['Tumor Volume (mm3)']
quartiles = ceft_quar.quantile([.25,.5,.75])
lowerquart = quartiles[.25]
upperquart = quartiles[.75]
iqr = upperquart-lowerquart

print(f'The lower quartile of the tumor volume is: {lowerquart}')
print(f'The upper quartile of the tumor volume is: {upperquart}')
print(f'The interquartile range of the tumor volume is: {iqr}')
print(f'The the median of tumor the volume is: {quartiles[.5]} ')

# Determine outliers by using upper and lower bounds
high = upperquart + (1.5*iqr)
low = lowerquart - (1.5*iqr)    

print(f'Values below {low} could be outliers.')
print(f'Values above {high} could be outliers.')

In [None]:
# Ramicane. Calculate the IQR and quantitatively determine if there are any potential outliers. 
rami_quar = rami_merge['Tumor Volume (mm3)']
quartiles = rami_quar.quantile([.25,.5,.75])
lowerquart = quartiles[.25]
upperquart = quartiles[.75]
iqr = upperquart-lowerquart

print(f'The lower quartile of the tumor volume is: {lowerquart}')
print(f'The upper quartile of the tumor volume is: {upperquart}')
print(f'The interquartile range of the tumor volume is: {iqr}')
print(f'The the median of tumor the volume is: {quartiles[.5]} ')

# Determine outliers by using upper and lower bounds
high = upperquart + (1.5*iqr)
low = lowerquart - (1.5*iqr)    

print(f'Values below {low} could be outliers.')
print(f'Values above {high} could be outliers.')

In [None]:
# Infubinol. Calculate the IQR and quantitatively determine if there are any potential outliers. 
infu_quar = infu_merge['Tumor Volume (mm3)']
quartiles = infu_quar.quantile([.25,.5,.75])
lowerquart = quartiles[.25]
upperquart = quartiles[.75]
iqr = upperquart-lowerquart

print(f'The lower quartile of the tumor volume is: {lowerquart}')
print(f'The upper quartile of the tumor volume is: {upperquart}')
print(f'The interquartile range of the tumor volume is: {iqr}')
print(f'The the median of tumor the volume is: {quartiles[.5]} ')

# Determine outliers by using upper and lower bounds
high = upperquart + (1.5*iqr)
low = lowerquart - (1.5*iqr)    

print(f'Values below {low} could be outliers.')
print(f'Values above {high} could be outliers.')

In [None]:
# Generate a box plot of the final tumor volume of each mouse across four regimens of interest
plot_regimen_data =[capo_quar, rami_quar, infu_quar, ceft_quar]
regimen_labels = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']

fig1, ax = plt.subplots(figsize=(10, 5))  # Plot size
f = 'lightgray'  # Property dictionaries (colors)
c = 'blue'
b = 'lightblue'
w = 'red'
m = 'blue'
ax.set_title('Tumor Volume of Selected Drug Regimen',fontsize =26)
ax.set_ylabel('Final Tumor Volume (mm3)',fontsize = 12)
ax.set_xlabel('Drug Regimen',fontsize = 12)
ax.boxplot(plot_regimen_data, labels=regimen_labels, widths = 0.4, patch_artist=True,vert=True, 
           boxprops=dict(facecolor=f, color=c),
           capprops=dict(color=c),
           whiskerprops=dict(color=c),
           flierprops=dict(color=w, markeredgecolor=w),
           medianprops=dict(color=m),
           )

plt.ylim(10, 80)

plt.savefig("../Images/box_plot.png", bbox_inches = "tight")

plt.show()

## Line and Scatter Plots

In [None]:
# Select mouse treated with Capomulin and display data
line_df = Capomulin_df.loc[Capomulin_df['Mouse ID'] == 'j246',:]
line_df.head()

In [None]:
# Generate a line plot of tumor volume vs. time point for a mouse #j246
x_axis = line_df['Timepoint']
y_axis = line_df['Tumor Volume (mm3)']

fig1, ax1 = plt.subplots(figsize=(10, 5))   # Plot size
plt.plot(x_axis, y_axis,linewidth=1, markersize=10,marker="o",color="blue")
plt.title('Mouse j246 Treated with Capomulin')
plt.xlabel('Time (Days)',fontsize = 12)
plt.ylabel('Tumor Volume (mm3)',fontsize = 12)
plt.show()

## Correlation and Regression

In [None]:
# Create scatter plot of tumor volume vs mouse weight for Capomulin treatment 
avg_capo_data = Capomulin_df.groupby(['Mouse ID']).mean()
x_val = avg_capo_data['Weight (g)']
y_val = avg_capo_data['Tumor Volume (mm3)']

fig1, ax1 = plt.subplots(figsize=(10, 5))   # Plot size
plt.scatter(x_val,y_val, s=120, color='orange')
plt.title('Mouse j246 Weight vs. Average Tumor Volume',fontsize =24)
plt.xlabel('Weight (g)',fontsize =12)
plt.ylabel('Averag Tumor Volume (mm3)',fontsize =12)
plt.show()

In [None]:
# Calculate the correlation coefficient and linear regression model 
# for mouse weight and average tumor volume for the Capomulin regimen
corr, _= pearsonr(x_val,y_val)
print(f'The correlation between #j246 weight and average tumor volume is: %.3f' % corr)

# use lingress to find the values used to plot the linear regression model
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_val,y_val)

regression_val = x_val*slope+intercept
print(f'Slope: {slope}')
print(f'Intercept: {intercept}')
print(f'Rvalue (Correlation Coefficient): {rvalue}')
print(f'Scipy Pearnsonr (Correlation Coefficeint):  %.2f%%' % corr)
print(f'Pvalue: {pvalue}')
print(f'Stderr: {stderr}')

In [None]:
# Using linear regression values, plot the linear regression model on top of the previous scatter plot. 
fig1, ax1 = plt.subplots(figsize=(10, 5))   # Plot size
plt.scatter(x_val,y_val, s=120, color='orange')
plt.title('Mouse j246 Weight vs. Average Tumor Volume',fontsize =24)
plt.xlabel('Weight (g)',fontsize =12)
plt.ylabel('Averag Tumor Volume (mm3)',fontsize =12)
plt.plot(x_val,regression_val,'b-')
plt.savefig('../Images/linear_regression.png', bbox_inches = 'tight')
plt.show()

In [None]:
# Challenging myself to use an alternate method to 
# plot the linear regression model on top of the previous scatter plot. 
fig1, ax1 = plt.subplots(figsize=(10, 5))   # Plot size
plt.scatter(x_val,y_val, s=120, color='orange')
plt.title('Mouse j246 Weight vs. Average Tumor Volume',fontsize =24)
plt.xlabel('Weight (g)',fontsize =12)
plt.ylabel('Averag Tumor Volume (mm3)',fontsize =12)

# Get slope and intercept of linear regression line (LRL)
m, b = np.polyfit(x_val,y_val, 1)

# add LRL to scatter plot
plt.plot(x_val, m*x_val+b)
plt.show()