`pandas` is able to import data from a variety of formats, including Excel workbooks with the `read_excel()` function. Let's go ahead and read in `contestants.xlsx` as `contestants`. 

## Importing the data

In [1]:
import pandas as pd
contestants = pd.read_excel('contestants.xlsx')

This can be done with the `head()` method to print just the first five rows. 

In [2]:
contestants.head()

Unnamed: 0,EMAIL,COHORT,PRE,POST,AGE,SEX,EDUCATION,SATISFACTION,STUDY_HOURS
0,smehaffey0@creativecommons.org,4,485,494,32,Male,Bachelor's,2,36.6
1,dbateman1@hao12@.com,4,462,458,33,Female,Bachelor's,8,22.4
2,bbenham2@xrea.com,3,477,483,33,Female,Bachelor's,1,19.8
3,mwison@@g.co,2,480,488,31,Female,Bachelor's,10,33.1
4,jagostini4@wordpress.org,1,495,494,38,Female,Bachelor's,9,32.5


## Cleaning up the columns 

A data analysis/transformation program worth its salt will be just as good at working with your _metadata_ as it is your data! `pandas` is particularly well suited for that.

I find it much easier to work with columns when they are all named in lowercase, we can easily do that with one line of code like so: 

In [3]:
# Convert headers to all lowercase

contestants.columns = contestants.columns.str.lower()
contestants.head()

Unnamed: 0,email,cohort,pre,post,age,sex,education,satisfaction,study_hours
0,smehaffey0@creativecommons.org,4,485,494,32,Male,Bachelor's,2,36.6
1,dbateman1@hao12@.com,4,462,458,33,Female,Bachelor's,8,22.4
2,bbenham2@xrea.com,3,477,483,33,Female,Bachelor's,1,19.8
3,mwison@@g.co,2,480,488,31,Female,Bachelor's,10,33.1
4,jagostini4@wordpress.org,1,495,494,38,Female,Bachelor's,9,32.5


## Pattern matching/regular expressions

The `contestants` DataFrame has a column called `email` consisting of email addresses for each participant in the context. We have been asked to go through this column and drop any rows that contain an invalid email address.

Searching through text to match patterns is done with a group of tools called _regular expressions_. While Power Query does have some basic capabilities for working with text such as converting cases, it does not have the ability to search for patterns of text like Python can. 

Regular expressions are notoriously hard to build and validate, so don't worry too much about how the expression below here was actually created, there is always the internet to figure it out! 

We will go ahead and define it here:

In [4]:
# Define a regular expression pattern for valid email addresses
email_pattern = r'^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'  

Next, we can use the `contains()` function to keep only the records that match the pattern.

In [5]:
full_emails = contestants[contestants['email'].str.contains(email_pattern)]

To confirm how many rows have been filtered out, we can compare the `shape` attribute of the two DataFrames:

In [1]:
contestants.shape

NameError: name 'contestants' is not defined

In [None]:
full_emails.shape

## Analyzing missing values 

A more thorough way to make sense of what's going on with the number of rows in the DataFrame is to use the `info()` method. 

In [6]:
full_emails.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 99
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   email         82 non-null     object 
 1   cohort        82 non-null     int64  
 2   pre           82 non-null     int64  
 3   post          82 non-null     int64  
 4   age           82 non-null     int64  
 5   sex           82 non-null     object 
 6   education     82 non-null     object 
 7   satisfaction  82 non-null     int64  
 8   study_hours   82 non-null     float64
dtypes: float64(1), int64(5), object(3)
memory usage: 6.4+ KB


Here you will see mention of `null`. This is not a thing in Excel but it is in Python and many other programming languages! It's just a missing value. 

In [7]:
# What percent of each column are missing?
full_emails.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 99
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   email         82 non-null     object 
 1   cohort        82 non-null     int64  
 2   pre           82 non-null     int64  
 3   post          82 non-null     int64  
 4   age           82 non-null     int64  
 5   sex           82 non-null     object 
 6   education     82 non-null     object 
 7   satisfaction  82 non-null     int64  
 8   study_hours   82 non-null     float64
dtypes: float64(1), int64(5), object(3)
memory usage: 6.4+ KB


In [8]:
# Sort by percentage
full_emails.isnull().mean().sort_values(ascending=False)

email           0.0
cohort          0.0
pre             0.0
post            0.0
age             0.0
sex             0.0
education       0.0
satisfaction    0.0
study_hours     0.0
dtype: float64

In [9]:
# Drop missing values
complete_cases = full_emails.dropna()
complete_cases.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 99
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   email         82 non-null     object 
 1   cohort        82 non-null     int64  
 2   pre           82 non-null     int64  
 3   post          82 non-null     int64  
 4   age           82 non-null     int64  
 5   sex           82 non-null     object 
 6   education     82 non-null     object 
 7   satisfaction  82 non-null     int64  
 8   study_hours   82 non-null     float64
dtypes: float64(1), int64(5), object(3)
memory usage: 6.4+ KB


In [10]:
# Write to Excel -- Pandas 
complete_cases.to_excel('contestants_cleaned.xlsx', index=False)