## Importing the data

The `pandas` library can import data from various formats, such as Excel workbooks using the `read_excel()` function. Let's read the `contestants.xlsx` file as `contestants`.

In [1]:
import pandas as pd
contestants = pd.read_excel('contestants.xlsx')

A `pandas` DataFrame may have thousands of rows or more, so printing them all is impractical. However, it is important to visually inspect the data -- a benefit Excel users are familiar with. To get a glimpse of the data and ensure it meets our expectations, we can use the `head()` method to display the first five rows.

In [2]:
contestants.head()

Unnamed: 0,EMAIL,COHORT,PRE,POST,AGE,SEX,EDUCATION,SATISFACTION,STUDY_HOURS
0,smehaffey0@creativecommons.org,4,485,494,32.0,Male,Bachelor's,2,36.6
1,dbateman1@hao12@.com,4,462,458,33.0,Female,Bachelor's,8,22.4
2,bbenham2@xrea.com,3,477,483,,Female,Bachelor's,1,19.8
3,mwison@@g.co,2,480,488,31.0,Female,Bachelor's,10,33.1
4,jagostini4@wordpress.org,1,495,494,38.0,Female,,9,32.5


Based on this data preview, we identified there are a few issues that need to be addressed. First, it appears that some of the emails contain an invalid format. We also have some values in the `AGE` and `EDUCATION` columns called `NaN` which don't seem to belong. We can address these and other issues in the dataset in ways that would be difficult or impossible to do with Excel's features. 

## Working with the metadata

A good data analysis and transformation program should be equally proficient in handling both data and metadata. In this regard, `pandas` stands out as a particularly suitable tool.

Currently, our DataFrame has column names in uppercase. To make typing column names easier, I prefer using lowercase names. Fortunately, in `pandas`, we can accomplish this with a single line of code: 

In [3]:
# Convert headers to all lowercase
contestants.columns = contestants.columns.str.lower()
contestants.head()

Unnamed: 0,email,cohort,pre,post,age,sex,education,satisfaction,study_hours
0,smehaffey0@creativecommons.org,4,485,494,32.0,Male,Bachelor's,2,36.6
1,dbateman1@hao12@.com,4,462,458,33.0,Female,Bachelor's,8,22.4
2,bbenham2@xrea.com,3,477,483,,Female,Bachelor's,1,19.8
3,mwison@@g.co,2,480,488,31.0,Female,Bachelor's,10,33.1
4,jagostini4@wordpress.org,1,495,494,38.0,Female,,9,32.5


## Pattern matching/regular expressions

The `email` column of this DataFrame contains email addresses for each contest participant. Our task is to remove any rows from this column that have invalid email addresses.

Text pattern matching like this is accomplished using a set of tools known as regular expressions. While Power Query does offer basic text manipulation capabilities, such as case conversions, it lacks the ability to search for specific patterns of text, a feature available in Python.

Regular expressions can be challenging to create and validate, but there are online resources available to assist with that. Here is the regular expression we will be using:

In [4]:
# Define a regular expression pattern for valid email addresses
email_pattern = r'^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'  

Next, we can use the `str.contains()` method to keep only the records that match the pattern.

In [5]:
full_emails = contestants[contestants['email'].str.contains(email_pattern)]

To confirm how many rows have been filtered out, we can compare the `shape` attribute of the two DataFrames:

In [6]:
# Dimensions of original DataFrame
contestants.shape

(100, 9)

In [7]:
# Dimensions of DataFrame with valid emails
full_emails.shape

(82, 9)

## Analyzing missing values 

The `info()` method offers a comprehensive overview of the DataFrame's dimensions and additional properties:

In [8]:
full_emails.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 99
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   email         82 non-null     object 
 1   cohort        82 non-null     int64  
 2   pre           82 non-null     int64  
 3   post          82 non-null     int64  
 4   age           81 non-null     float64
 5   sex           82 non-null     object 
 6   education     81 non-null     object 
 7   satisfaction  82 non-null     int64  
 8   study_hours   82 non-null     float64
dtypes: float64(2), int64(4), object(3)
memory usage: 6.4+ KB


In Python and other programming languages, `null` refers to a missing or undefined value. In `pandas` DataFrames, this is typically denoted as `NaN`, which stands for "Not a Number."

While basic Excel lacks an exact equivalent to `null`, Power Query [does provide this value](https://stringfestanalytics.com/how-to-understand-null-and-missing-values-in-power-query/), which greatly aids in data management and inspection. However, it can be difficult to programatically work with these missing values in Power Query. `pandas` makes this easier. 

For example I might want to see what columns have the most percentage of missing values. I can do this easily with `pandas`: 

In [9]:
# Sort by percentage
full_emails.isnull().mean().sort_values(ascending=False)

age             0.012195
education       0.012195
email           0.000000
cohort          0.000000
pre             0.000000
post            0.000000
sex             0.000000
satisfaction    0.000000
study_hours     0.000000
dtype: float64

Because there are so few missing values, we will simply drop any row that has a missing observation in any column:

In [10]:
# Drop missing values
complete_cases = full_emails.dropna()

To confirm that all missing observations have been cleared from the DataFrame, we can use the info() method again.

In [11]:
complete_cases.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 80 entries, 0 to 99
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   email         80 non-null     object 
 1   cohort        80 non-null     int64  
 2   pre           80 non-null     int64  
 3   post          80 non-null     int64  
 4   age           80 non-null     float64
 5   sex           80 non-null     object 
 6   education     80 non-null     object 
 7   satisfaction  80 non-null     int64  
 8   study_hours   80 non-null     float64
dtypes: float64(2), int64(4), object(3)
memory usage: 6.2+ KB


To export the resulting DataFrame to a spreadsheet, we can use the `to_excel()` method from the `pandas` library:

In [12]:
# Write to Excel -- Pandas 
complete_cases.to_excel('contestants_cleaned.xlsx', index=False)