# Introduction to statistical testing

This afternoon we will focus on the application of Student and Welsh's test on Pauline epistles, in order to assess possible different use of conjunctions that would indicate a **different authorship** between the epistle to the Colossians and "authentic" epistles.

Requirements for the lab that you should have installed in your virtual environment:
- `scipy`
- `pandas`
- `scikit-learn`
- `seaborn`

## Data pre-processing

In the dataset `pauline.csv` you will find a three columns dataset containing the **lemmatized** text of the authentic epistles and Colossians split over a 50 token frequency. The dataset has three columns: `book`, `chunk` (corresponds to the 50 tokens text chunk), and `text`.

The goal of the pre-processing is now to structure the dataset in order to have the following column:
`book`, `chunk` (corresponds to the 50 tokens text chunk) and `functional_words_frequency`.

The list of words that can be considered to be functional words is stored in the list `STOP_WORDS`:

In [59]:
STOP_WORDS = set([
    "ἄλλος", "ἄν", "ἄρα", "ἀλλ'", "ἀλλά", "ἀπό", "αὐτός", "δ'", "δαί", "δαίς", "δέ", "δή",
    "διά", "ἑαυτοῦ", "ἔτι", "ἐάν", "ἐγώ", "ἐκ", "ἐμός", "ἐν", "ἐπί", "εἰ", "εἰμί", "εἶμι",
    "εἰς", "γάρ", "γὰ", "γε", "ἡ", "ἦ", "καί", "κατά", "μέν", "μετά", "μή", "ὁ", "ὅδε",
    "ὅς", "ὅστις", "ὅτι", "οἱ", "οὕτως", "οὗτος", "οὐ", "οὔτε", "οὖν", "οὐδέ", "οὐδείς",
    "οὐκ", "παρά", "περί", "πρός", "σός", "σύ", "σύν", "τά", "τε", "τήν", "τῆς", "τῇ",
    "τί", "τί", "τίς", "τις", "τό", "τόν", "τοί", "τοιοῦτος", "τούς", "τοῦ", "τῶν", "τῷ",
    "ὑμός", "ὑπέρ", "ὑπό", "ὥστε", "ὡς", "ὦ"
])

**Exercice**:
1. Load the dataset in `pauline.csv` into a pandas DataFrame `pauline`.
2. Compute on the column `text` the word counts of the words contained in the list `STOP_WORDS` (hint: use the parameter `vocabulary` of the class `CountVectorizer` that we used on Day 3).
3. Sum these words frequencies in order to have the global functional word frequencies and store them into a variable containing numpy arrays `word_freq`.
4. Using the method `assign`, add a new column `functional_freq` to the DataFrame `pauline`.

## Data analysis

Before performing statistical tests, we can check visually and statistically the distribution of the data, using the skills that we developed during day 2.

**Exercice**:
1. Using `pandas` filter, create a new variable called `is_col`, that equals to 1 if the chunk corresponds to the book `Col` and 0 otherwise.
2. Assign using the method `assign`, add this new variable to the dataset `pauline` (you can for example call it `is_col`).
3. Give the major statistical estimators (mean, median, standard error) for functional word frequencies across Colossians and the authentic epistles.
4. Plot the distribution of functional word frequencies across Colossians and the authentic epistles using the adequate graph for this data type.

## Performing statistical tests 

Student tests are available in `scipy` using the `ttest_ind` from the module `stats`.

In [53]:
from scipy.stats import ttest_ind

It provides several options (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind), the most important for us being the `equal_var` parameter, that when sets to `True` performs a standard Student test and set to `False` a Welsch test. In this lab, we will go for the Welsch test as we detailed it lecture.
It outputs the `t_stat` as well as the `p_val`.

In [57]:
# Example using randomly generated artificial samples
import numpy as np

# Generate two samples to test
sample_1 = np.random.normal(0, 1, 1000)
sample_2 = np.random.normal(1, 1, 1000)

# Perform the t-test
t_stat, p_val = ttest_ind(sample_1, sample_2)

# Can you conclude regarding a different distribution ?

**Exercice**
1. Using this new variable `is_col`, extract as two `numpy` arrays the frequencies for the epistle of the Colossian and for the authentic epistles (using the attribute `.values` of pandas Series).
2. Perform the Student's test on the two arrays.
3. Conclude regarding different use of functional words in Colossian and authentic episles.
4. **Bonus**: Test the normality hypothesis using Shapiro's test (`scipy.stats.shapiro`).


### Bonus: linear regression and statistical testing 

Go to `lab_answers.ipynb `