In this project, you will analyze the PlantGrowth R dataset. You will find a short description of it on Vicent Arel-Bundock's Rdatasets page. The dataset contains two main variables, a treatment group and the weight of plants within those groups.

Your task is to perform t-tests and ANOVA on this dataset while describing the dataset and explaining your work. In doing this you should:

Download and save the dataset to your repository.

Describe the data set in your notebook.

Describe what a t-test is, how it works, and what the assumptions are.

Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.

Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.

Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

#### Importing Required Libraries

In [None]:
## Mathematical functions from the standard library
# https://docs.python.org/3/library/math.html  
import math 

## Numerical structures and operations
# https://numpy.org/doc/stable/reference/index.html  
import numpy as np 

## Plotting
# https://matplotlib.org/stable/contents.html  
import matplotlib.pyplot as plt

## Random selections
# https://docs.python.org/3/library/random.html  
import random

## Permutations and combinations
# https://docs.python.org/3/library/itertools.html  
import itertools

import pandas as pd

### Statistics
import scipy.stats as stats  
import seaborn as sns 
import statistics

## 1 Downloading and saving the required dataset

The dataset used for this project is the [PlantGrowth R dataset](https://vincentarelbundock.github.io/Rdatasets/index.html). 
This PlantGrowth dataset contains the information recorded on plant growth under certain conditions. The dataset is comprised of two main variables, a treatment group (ctrl, trt1 and trt2) and the weight of plants within those groups.  
The dataset was originally sourced in Dobson's An Introduction to Statistical Modelling (1983).  
The dataset is contained in the PlantGrowth.csv file in the [AppliedStatistics repository](https://github.com/rebeccaf1918/AppliedStatistics/blob/main/PlantGrowth.csv) by Rebecca Feeley.

## 2 Description of the Dataset

In [None]:
# loading in dataset and printing out the first 5 rows
data = pd.read_csv('PlantGrowth.csv')
print(data.head())

In [None]:
# Now I will conduct basic summary analysis of the dataset
print(data.shape)
print(data['group'].value_counts()) # determining how many values are in each group

# Inspecting the dataset for any missing values - if the output is false it means there are no missing values
data.isnull().any()

The PlantGrowth dataset, which I have loaded into a pandas dataframe, contains 30 instances of 2 columns (a column being the weight as dried weight of plant and a column showing the group used in the experiment). The group category contains 10 instanees per group (one control and two treatment groups).   
The yields of plant growth for the two treatment conditions were compared against a control in the experiment conducted. There is a numerical vairable i,e the weight (float64) and a categorical variable i.e the group (object).   
For clarity sake, I have chosen to remove the rownames column as there is an index included in the data and this column holds no additional data beyond the index.


In [None]:
data.drop('rownames',
  axis='columns', inplace=True)


# checking that the deletion operation was done correctly
data.head()


Now, I am using the head() and tail() functions to display the first 5 lines of the  data set and the last 5 lines of the  data set. This allows us to see the column names of the data, how many columns are in the data, and a general overview of the top and bottom of the dataset.

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info() # this is a useful function in pandas which summarises the above info in one function output

In [None]:

data.describe() # using the describe function to determine various characteristics incl the mean, std deviation and more



In [None]:
data.groupby('group').describe()  #generating the summary statistics by group to see if any trends occur due to group type

### Visual Analysis

In order to better understand the dataset, I will now conduct analysis of the data using visualisation methods to gain a better insight. I will create a boxplot, scatter plot and histogram to visual and better understand the dataset.  

I will firstly create a histogram which is categorised by group and allows me to visualise the distribution of the data and then I will create a boxplot and violin plot of the groups which allows me to more easily compare the results based on the group the plant is categorised within.


### Histogram
Histograms are useful for visualising the distribution of a single variable - in this case we can easily see the distribution of the weight of each plant of the different groups (control, treatment1 and treament 2)

In [None]:
data.hist(column='weight', by='group', bins=10, color= 'green', edgecolor='black', figsize=(10, 8))
plt.suptitle('Histograms of Plant Weights by Group')
plt.subplots_adjust(top=0.9, hspace=0.4)
plt.show()

### Boxplot
Next, I will create a boxplot based on the dataset. Boxplots have many useful functions, in particular it allows us to visualise the different in the plant weights across each of the groups and it makes the central tendency of the plant growth more distinguishable.

In [None]:
sns.boxplot(x='group', y='weight', data=data)
plt.title('Box Plot of Plant Weights by Group')
plt.show()

In [None]:
sns.violinplot(x='group', y='weight', data=data)
plt.title('Violin Plot of Plant Weights by Group')
plt.show()


## Describe what a t-test is, how it works, and what the assumptions are.

A t test is a statistical test which is used to determine whether or not there is a significant diferrence between two samples. It does this by comparing the means of two groups and determining if the means are significantly different. It is particularly useful for determining if the differences in the means are due to chance or whetehr they are statistically significant.
The t test was originally developed by [William Seely Gosset](https://www.scientificamerican.com/article/how-the-guinness-brewery-invented-the-most-important-statistical-method-in/) who worked for the Guinness Brewing Company and developed the t test as part of his goal of measuring the quality of stout produced. He published his t test in 1908 under the psudynom 'Student' so as not to tip of competitors of his findings.
The t test has since become one of the cornerstones of modern statistical analysis and is often used in hypothesis testing to determine whether a particular treatment or process has any actual effect on a group, e.g a control vs a treatment group. It is also often used to determine whether two groups are different from one another.  
The t test is most commonly used when the data is normally distributed but the population variance is unknown.
The t test looks at the means of two groups, and then calculates a t-statistic. This t statistic is used to determine if the null hypothesis is true or false. (The null hypothesis is that there is no significant differnce between the means of the two groups.)
It then uses the resulting p-value to decide if the difference is statistically significant based on a chosen threshold (this threshold is often 0.05)

There are 3 main types of t test:  
One-sample t-test: Tests the mean of a single group to a known value or population mean.
Two-sample t-test (Independent t-test): Tests the means of two independent, separate groups.
Paired Sample t-test: Tests means from the same group at differing times (e.g., before and after a treatment).

For the purposes of analysis of the PlantGrowth dataset, I will be conducting a Two-sample T-test (Independent t-test) as the there are two independent groups (i.e different plant groups) and the treatment provided has not used just one plant sample (i.e two separate treament groups/plant groups)

### Assumptions required to carry out a t test
In order for a t test to be carried out correctly, and to ensure that the results of any t test are valid, several assumptions must be met.  
These assumptions can differ based on the particular type of t test used. As I will be utilising a Two-sample t test (Independent t test), I will focus on the assumptions required for this t test.
The two samples must be indpendent of each other. 
The dependent variable should be continuous in nature
The independent variables should be categorical in nature and independent of each other
The data in each group should be normally distributed
Each sample must be randomly sampled from the respective populations.
The variences of the two groups should be equal (homogeneity of variances)

If these assumptions are not accurate for the data used in a t test, then other methods of analysis must be used as the t test result will not prove to be reliable or accurate.


## Performing a t test