# McNemar Chi-Square Python
You will now learn how to conduct McNemar Chi-Squares in Python. The process is much the same, but the output leaves something to be desired.

## Load in Packages
n order to run McNemar Chi-Squares in Python, you will need `pandas` to read in your data, and `statsmodels` to analyze it:

In [1]:
import pandas as pd
import statsmodels as sm
from statsmodels.stats.contingency_tables import mcnemar

---
## Load in Data

In [5]:
bakery = pd.read_csv('./assets/bakery_sales.csv')

In [6]:
bakery.head(2)

Unnamed: 0,Date,Time,Transaction,Item
0,10/30/2016,9:58:11 AM,1,Bread
1,10/30/2016,10:05:34 AM,2,Scandinavian


---
## Question Set Up
You will be answering the following question:<br>
`Do the sales of coffee change from the beginning of the month to the end of the month?`

---
## Data Wrangling
Just like with R, you will need to do some data wrangling.

#### Separating the pieces of the Date Variable
The first order of business is to separate out your `Date` column. You can do this with the function `str.split()`:

In [7]:
bakery1 = bakery['Date'].str.split('/', expand=True).rename(columns = lambda x: "Date" + str(x +1))

And then of course you'll need to put your data back together again:

In [8]:
bakery3 = pd.concat([bakery, bakery1], axis=1)

#### Changing Day to an Integer
Next you'll need to recode the `Date2` variable so that it provides information about beginning or ending of the month. To do this, your `Date2` variable will need to be an integer. You can double check that it is with the function `info()`:

In [9]:
bakery3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21293 entries, 0 to 21292
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Date         21293 non-null  object
 1   Time         21293 non-null  object
 2   Transaction  21293 non-null  int64 
 3   Item         21293 non-null  object
 4   Date1        21293 non-null  object
 5   Date2        21293 non-null  object
 6   Date3        21293 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


So it looks like `Date2` is currently string data, which is common after doing the `str.split()` function - after all, it literally translates into "string split!" However, this is an easy fix - you can use the `astype(int)` function:

In [10]:
bakery3.Date2 = bakery3.Date2.astype(int)

And now if you run `info()` again, you will find that `Date2` is now an integer!

In [11]:
bakery3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21293 entries, 0 to 21292
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Date         21293 non-null  object
 1   Time         21293 non-null  object
 2   Transaction  21293 non-null  int64 
 3   Item         21293 non-null  object
 4   Date1        21293 non-null  object
 5   Date2        21293 non-null  int64 
 6   Date3        21293 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.1+ MB


#### Recoding to beginning or end of Month
Now that your variable is numeric, you are good to do a recode using the greater than and less than operands. You can recode using a function with some if statements, and then apply that function to your data:

In [12]:
def month (series): 
    if series <= 15: 
        return 0
    if series > 16: 
        return 1
bakery3['DayR'] = bakery3["Date2"].apply(month)

#### Recoding to `Coffee` or `Other`
Next, you will recode the `Item` variable into something that is Coffee or Not Coffee. You will use the same format as the recode above:

In [13]:
def item (series): 
    if series == "Coffee": 
        return 1
    if series != "Coffee": 
        return 0
bakery3['CoffeeYN'] = bakery3["Item"].apply(item)

#### Make a Contingency Table
Next, you will need to make a contingency table, since the function for McNemar Chi-Squares in Python will not accept raw data. Happily, the `pd.crosstab()` function you learned earlier will do this job easily for you:

In [14]:
bakery_crosstab = pd.crosstab(bakery3['DayR'], bakery3['CoffeeYN'])
bakery_crosstab

CoffeeYN,0,1
DayR,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,8238,2841
1.0,7126,2491


---
## Test assumptions and run analysis
In Python, there is no way to test the assumption of at least five expected per cell, which means that if you are analyzing high stakes data, where accuracy really matters, Python is NOT the program for you to run a McNemar Chi-Square in.

You will use the function `sm.stats.contingency_tables.mcnemar()` to run your McNemar Chi-Square. It takes the arguments of the crosstab you just created, `exact=`, which you can set to `False`, and `correction=`, which will be set to `True`.

In [15]:
result = sm.stats.contingency_tables.mcnemar(bakery_crosstab, exact=False, correction=True)

If you just run the code above, you may end up confused - nothing happened! That's because this particular function requires you to actually print your results out yourself:

In [16]:
print(result)

pvalue      0.0
statistic   1841.3420286946925


---
## Interpret Results
Alright! You now have results, and they are significant - the p value is less than .05, so it looks like different amounts of coffee is sold in the morning and afternoon! How does it differ? With Python, you'll NEVER KNOW! It does not provide the ability to look at standardized residuals, so you can't look at post hocs.