<a href="https://colab.research.google.com/github/worldwidekatie/basic_stats/blob/master/Basic_Stats_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://docs.google.com/drawings/d/e/2PACX-1vS4FTC09V3OsgPrg_pBRGPEO9q4FohcGPImvN8GNcVUVy0z4SBFtUigQURgAzv8ztZzHI96UOMsWWKl/pub?w=75&h=74)

[AmeliorMate.com](https://www.AmeliorMate.com)

# **Basic Statistics**

The purpose of this notebook is to provide academics without access to their normal statistical software a low-code option for running their most basic and popular statistical tests. The notebook includes instructions for running:
- T-Tests
-Linear Regression
-Chi^2

## **How to Get Started with this Notebook:**

### Start by making your own copy of the notebook by going to File -> Save a copy in drive...

## **How to Upload Your csv file**

### 1. Run CELL #1 by pushing the play button that appears when you hover over the [ ] in the upper left-hand corner.

### 2. Click the "Choose Files" or "Browse" button (which appears after you've hit the "play" button) and choose the csv you want to upload from your computer.

In [0]:
# CELL 1
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import plotly.express as px
from google.colab import files
files.upload()

### 3. Copy and paste the name of your csv inside the parentheses  in CELL #2 then run CELL #2. 

**NOTE** It is important to use a csv that already has headers. The following tests will not work if your csv does not already have headers. If your file has loaded correctly, you will see the headers and the first few rows below CELL #2 once you've run it.

In [0]:
# CELL 2
df = pd.read_csv('YOUR_CSV_NAME_HERE.csv')
df.head()

# How to run t-tests:
**NOTE: Make sure every column (except the column for your dependent variable) contains only numeric values or it will not work.**
### 1. Run CELL #3 then go to the step #2.
(It will seem like nothing happened when you ran CELL #3 but don't worry, important things happened behind the scenes. If there's a number where there was previously a blank [   ], you know it ran correctly).


In [0]:
# CELL 3
def t_test(dataframe, column, group_a, group_b):
  groupaa = dataframe[dataframe[column]==group_a]
  groupa = groupaa.drop(columns=[column])
  groupbb = dataframe[dataframe[column]==group_b]
  groupb = groupbb.drop(columns=[column])
  themes = groupa.columns.tolist()
  output=[] 

  for theme in themes:
    output.append([theme, groupa[theme].mean(), groupb[theme].mean(), 
                   stats.ttest_ind(groupa[theme], groupb[theme], nan_policy='omit')])

  output2 = pd.DataFrame([[i[0], i[1], i[2], i[3][0], i[3][1]] for i in output],
                  columns=['Variable', 'Group A Mean', 'Group B Mean', 'T-Statistic', 'P-Value'])

  return output2.sort_values(by=['P-Value'])

## 2. Fill in the parenthases below:
a. **Do not alter t_test(df,** or it will break everything. Only replace the three items in quotations, and be sure to keep the quotation marks.

b. Fill in `'COLUMN_NAME_HERE'` with the header for the column that lists which dependent variable group the row belongs to.

c. Fill in `'GROUP_A_NAME_HERE'` with the name of the first group whose means you'd like to test.

d. Fill in `'GROUP_B_NAME_HERE'` with the name of the second group whose means you'd like to test.

**It is better to copy and paste** because if any of these three items are not **exactly** as they appear in the csv you uploaded, it will not run.

## 3. Run CELL #4

In [0]:
# CELL 4
t_test(df, 'COLUMN_NAME_HERE', 'GROUP_A_NAME_HERE', 'GROUP_B_NAME_HERE')

# How to Run Linear Regression
### 1. Run CELL #5. 
(It will seem like nothing happened when you ran CELL #5 but don't worry, important things happened behind the scenes. If there's a number where there was previously a blank [   ], you know it ran correctly).


In [0]:
# CELL 5

def lin_reg(dataframe, features, target):
  model = LinearRegression()
  x_train = dataframe[features]
  y_train = dataframe[target]
  model.fit(x_train, y_train)
  y_pred = model.predict(x_train)
  mse = mean_squared_error(y_train, y_pred)

  print("Features Used:")
  print(features)
  print('\n')
  print("Model R^2:")
  print(r2_score(y_train, y_pred))
  print('\n')
  print('Model Intercept:')
  print(model.intercept_)
  print('\n')
  print("Model Coefficients:")
  print(model.coef_)
  print('\n')
  print("Mean Squared Error:")
  print(mse)
  print('\n')
  print("Root Mean Squared Error:")
  print(np.sqrt(mse))
  print('\n')
  print("Mean Absolute Error")
  print(mean_absolute_error(y_train, y_pred))
  print('\n')

## 2. Fill in the parenthases below:
a. **Do not alter `lin_reg(df, independent_variables, dependent_variables)`** or it will break everything. Only replace the items in quotations inside the brackets, and be sure to keep the quotation marks.

b. Fill in `'DEPENDENT_VARIABLE_HERE'` with the header for the column that contains values for your dependent variable.

c. Fill in `'INDEPEDENT_VARIABLE_1_HERE'` with the header name for the column containing the first independent variable you would like to add to your linear regression model. If you only have one independent variable, delete `'INDEPENDENT_VARIABLE_2_HERE'`

*If you only have one independent varialbe it should look like this:

```
# independent_variables = ['HEADER_NAME']
```


d. If you have two independent variables, fill in `'INDEPENDENT_VARIABLE_2_HERE'` with the column header name for your second independent variable.

e. If you have more than two independent variables, add them to the list by seperating them with a comma and putting them in parenthases. For example, four independent variables would look like this:



```
# independent_variables = ['HEADER_NAME_1', 'HEADER_NAME_2', 'HEADER_NAME_3', 'HEADER_NAME_4',]
```

You can include as many independent variables as you'd like.



**It is better to copy and paste** because if any column header is not entered **exactly** as it appears in the csv you uploaded, it will not run.

**Only NUMERIC values** accepted for both **independent and dependent variables.**

## 3. Run CELL #6

In [0]:
# CELL 6

dependent_variable = 'DEPENDENT_VARIABLE_HERE'

independent_variables = ['INDEPEDENT_VARIABLE_1_HERE', 'INDEPENDENT_VARIABLE_2_HERE']

lin_reg(df, independent_variables, dependent_variable)

# For a linear regression graph with one independent and one dependent variable: 
1. Fill in x_axis by replacing `'DEPENDENT_VARIABLE_HERE'` with the column header for your dependent variable.
2. Fill in y_axis by replacing `'INDEPEDENT_VARIABLE_HERE'` with the column header for your independent variable.
3. Run CELL #7

You will get an interactive graph. Hover over the line to see the formula for your ordinary least squares line and your r^2. Hover over dots to or other parts of the graph to see the predited value for a given point.

Use the camera-looking icon that appears when you hover over the top right-hand corner to download a png of your graph.

In [0]:
# CELL 7

x_axis = 'DEPENDENT_VARIABLE_HERE'

y_axis = 'INDEPEDENT_VARIABLE_HERE'

px.scatter(df, x=x_axis, y=y_axis, trendline='ols')

# To Run Chi^2

## 1. Run CELL #8 below. 
(It will seem like nothing happened when you ran CELL #8 but don't worry, important things happened behind the scenes. If there's a number where there was previously a blank [   ], you know it ran correctly).

In [0]:
# CELL 8

def Chi2_loop(df, dependent_var):
  columns = df.columns.tolist()
  columns.remove(dependent_var)
  output=[]

  for column in columns:
    crosstab = pd.crosstab(df[dependent_var], df[column])
    crosstab = crosstab.values
    chi2, p_value, dof, expected = stats.chi2_contingency(crosstab)
    output.append([column, chi2, p_value])
  
  df2 = pd.DataFrame([[i[0], i[1], i[2]] for i in output],
                  columns=['Independent Variable', 'Chi^2', 'P-Value'])

  return df2.sort_values(by=['P-Value'])

## 2. Fill in your dependent variable.
- Replace `'DEPENDENT_VARIABLE'` with the column header of the variable you'd like to run a chi^2 test with.

**It is better to copy and paste** because if any column header is not entered **exactly** as it appears in the csv you uploaded, it will not run.

## 3. Run CELL #9.

In [0]:
# CELL 9

Chi2_loop(df, 'DEPENDENT_VARIABLE')

Unnamed: 0,Independent Variable,Chi^2,P-Value
3,PetalWidthCm,271.75,2.16481e-35
2,PetalLengthCm,271.8,1.177567e-21
0,SepalLengthCm,156.266667,6.665987e-09
1,SepalWidthCm,88.364469,8.303948e-05


# Thanks for using our basic statistics colab notebook! If you have any issues, comments, or requests for features, please email Katie at katie@ameliormate.com