In this project, you will analyze the PlantGrowth R dataset. You will find a short description of it on Vicent Arel-Bundock's Rdatasets page. The dataset contains two main variables, a treatment group and the weight of plants within those groups.

Your task is to perform t-tests and ANOVA on this dataset while describing the dataset and explaining your work. In doing this you should:

Download and save the dataset to your repository.

Describe the data set in your notebook.

Describe what a t-test is, how it works, and what the assumptions are.

Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.

Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.

Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

#### Importing Required Libraries

In [5]:
## Mathematical functions from the standard library
# https://docs.python.org/3/library/math.html  
import math 

## Numerical structures and operations
# https://numpy.org/doc/stable/reference/index.html  
import numpy as np 

## Plotting
# https://matplotlib.org/stable/contents.html  
import matplotlib.pyplot as plt

## Random selections
# https://docs.python.org/3/library/random.html  
import random

## Permutations and combinations
# https://docs.python.org/3/library/itertools.html  
import itertools

import pandas as pd

### Statistics
import scipy.stats as stats  
import seaborn as sns 
import statistics

## 1 Downloading and saving the required dataset

The dataset used for this project is the [PlantGrowth R dataset](https://vincentarelbundock.github.io/Rdatasets/index.html). 
This PlantGrowth dataset contains the information recorded on plant growth under certain conditions. The dataset is comprised of two main variables, a treatment group (ctrl, trt1 and trt2) and the weight of plants within those groups.  
The dataset was originally sourced in Dobson's An Introduction to Statistical Modelling (1983).  
The dataset is contained in the PlantGrowth.csv file in the [AppliedStatistics repository](https://github.com/rebeccaf1918/AppliedStatistics/blob/main/PlantGrowth.csv) by Rebecca Feeley.

## 2 Description of the Dataset

In [6]:
# loading in dataset and conducting basic analysis of the dataset
data = pd.read_csv('PlantGrowth.csv')
print(data.shape)
print(data.head())
print(data['group'].value_counts())

(30, 3)
   rownames  weight group
0         1    4.17  ctrl
1         2    5.58  ctrl
2         3    5.18  ctrl
3         4    6.11  ctrl
4         5    4.50  ctrl
group
ctrl    10
trt1    10
trt2    10
Name: count, dtype: int64


The PlantGrowth dataset, which I have loaded into a pandas dataframe, contains 30 instances of 2 columns (a column being the weight as dried weight of plant and a column showing the group used in the experiment). The group category contains 10 instanees per group (one control and two treatment groups).   
The yields of plant growth for the two treatment conditions were compared against a control in the experiment conducted. There is a numerical vairable i,e the weight (float64) and a categorical variable i.e the group (object).   
For clarity sake, I have chosen to remove the rownames column as there is an index included in the data and this column holds no additional data beyond the index.


In [7]:
data.drop('rownames',
  axis='columns', inplace=True)

print(data.head())
# checking that the deletion operation is done correctly 

   weight group
0    4.17  ctrl
1    5.58  ctrl
2    5.18  ctrl
3    6.11  ctrl
4    4.50  ctrl


In [8]:
print(data.head()) # reviewing the dataframe structure
data.describe() # using the describe function to determine various characteristics incl the mean, std deviation and more

   weight group
0    4.17  ctrl
1    5.58  ctrl
2    5.18  ctrl
3    6.11  ctrl
4    4.50  ctrl


Unnamed: 0,weight
count,30.0
mean,5.073
std,0.701192
min,3.59
25%,4.55
50%,5.155
75%,5.53
max,6.31
