# Etract, Transform, Load

**Goal**: Exploratory Data Analysis - to design the experiment. Transform the data and load it back to the database.

Specific:

- Extract applicant information regarding addmissions quiz completion.
- Design a research question, null hypothesis and alternative hypothesis for the experiment.
- Create functions for transforming applicant documents and loading them to a database.
- Build a Python class to streamline the experiment.

In [2]:
import random

import pandas as pd
from pymongo import MongoClient

### Extract

Aggregate Clients by Quiz Completion

In [3]:
complete = 3717
incomplete = 1308

In [4]:
total = complete + incomplete
prop_incomplete = incomplete/total
print(
    "Proportion of users who don't complete admissions quiz:", round(prop_incomplete, 2)
)

Proportion of users who don't complete admissions quiz: 0.26


#### Developing a Research Question

RQ: Does sending to no-quiz applicants Increase their likelihood of taking admission exams?

In [5]:
null_hypothesis = " No significant difference in the quiz completion between the 2 groups"

alternate_hypothesis = " A significant difference in the quiz completion between the 2 groups"

print("Null Hypothesis:", null_hypothesis)
print("Alternate Hypothesis:", alternate_hypothesis)

Null Hypothesis:  No significant difference in the quiz completion between the 2 groups
Alternate Hypothesis:  A significant difference in the quiz completion between the 2 groups


Find_by_date function

In [6]:
def find_by_date(collection, date_string):
    """Find records in a PyMongo Collection created on a given date.

    Parameters
    ----------
    collection : pymongo.collection.Collection
        Collection in which to search for documents.
    date_string : str
        Date to query. Format must be '%Y-%m-%d', e.g. '2022-06-28'.

    Returns
    -------
    observations : list
        Result of query. List of documents (dictionaries).
    """
    collection = ds_app
    date_string = "2022-05-04"
    # Convert `date_string` to datetime object
    start = pd.to_datetime(date_string, format='%Y-%m-%d')
    # Offset `start` by 1 day
    end = start + pd.DateOffset(days=1)
    # Create PyMongo query for no-quiz applicants b/t `start` and `end`
    query = {"createdAt": {"$gte": start, "$lt": end}, "admissionsQuiz": "incomplete"}
    # Query collection, get result
    result=collection.find(query)
    # Convert `result` to list
    observations = list(result)
    return observations

### Transform: Designing the Experiment

- This step involves manipulating the data extracted.

Assign_to_groups Function

-  Takes a list of new user documents as input and adds two keys to each document. The first key should be "inExperiment", and its value should always be True. The second key should be "group", with half of the records in "email (treatment)" and the other half in "no email (control)"

In [7]:
def assign_to_groups(observations):
    """Randomly assigns observations to control and treatment groups.

    Parameters
    ----------
    observations : list or pymongo.cursor.Cursor
        List of users to assign to groups.

    Returns
    -------
    observations : list
        List of documents from `observations` with two additional keys:
        `inExperiment` and `group`.
    """

    # Shuffle `observations`
    random.seed(42)
    random.shuffle(observations)
    
    # Get index position of item at observations halfway point
    idx = len(observations) // 2

    # Assign first half of observations to control group
    for doc in observations[:idx]:
        doc["inexperiment"] = True
        doc["group"] = "no email (control)"

    # Assign second half of observations to treatment group
    for doc in observations[idx:]:
        doc["inexperiment"] = True
        doc["group"] = "email (treatment)"

    return observations

Export_treament_emails Function 

-  Takes a list of documents (like observations_assigned) as input, creates a DataFrame with the emails of all observations in the treatment group, and saves the DataFrame as a CSV file.

In [8]:
def export_treatment_emails(observations_assigned, directory="."):
    """Creates CSV file with email addresses of observations in treatment group.

    CSV file name will include today's date, e.g. `'2022-06-28_ab-test.csv'`,
    and a `'tag'` column where every row will be 'ab-test'.

    Parameters
    ----------
    observations_assigned : list
        Observations with group assignment.
    directory : str, default='.'
        Location for saved CSV file.

    Returns
    -------
    None
    """
    # Put `observations_assigned` docs into DataFrame
    df = pd.DataFrame(observations_assigned)
    

    # Add `"tag"` column
    df["tag"] = "ab-test"

    # Create mask for treatment group only
    mask = df["group"] == "email (treatment)"

    
    # Create filename with date
    date_string=pd.Timestamp.now().strftime(format = "%Y-%m-%d")
    filename = directory + "/" + date_string + "_ab-test.csv"
    
    # Save DataFrame to directory (email and tag only)
    df[mask][["email", "tag"]].to_csv(filename, index=False)

### Load: Preparing the Data

In [None]:
updated_applicant = observations_assigned[0]
applicant_id = updated_applicant["_id"]

update_applicants function

- Takes a list of document like as input, updates the corresponding documents in a collection, and returns a dictionary with the results of the update.

In [None]:
def update_applicants(collection, observations_assigned):
    """Update applicant documents in collection.

    Parameters
    ----------
    collection : pymongo.collection.Collection
        Collection in which documents will be updated.

    observations_assigned : list
        Documents that will be used to update collection

    Returns
    -------
    transaction_result : dict
        Status of update operation, including number of documents
        and number of documents modified.
    """
    # Initialize counters
    n = 0
    n_modified = 0
    
    # Iterate through applicannts
    for doc in observations_assigned:
        result = ds_app.update_one(filter={"_id": doc["_id"]},update={"$set": doc})
        n += result.matched_count
        n_modified += result.modified_count
    
    # create results
    transaction_result = {"n": n, "nModified": n_modified}
    
    return transaction_result

### Python Classes: Building the Repository

MongoRepository Class

In [None]:
class MongoRepository:
    """Repository class for interacting with MongoDB database.

    Parameters
    ----------
    client : `pymongo.MongoClient`
        By default, `MongoClient(host='localhost', port=27017)`.
    db : str
        By default, `'wqu-abtest'`.
    collection : str
        By default, `'ds-applicants'`.

    Attributes
    ----------
    collection : pymongo.collection.Collection
        All data will be extracted from and loaded to this collection.
    """

    # Task 7.2.14
    def __init__(self,
                client=MongoClient(host="localhost", port=27017),
                db="wqu-abtest",
                collection="ds-applicants"):
        self.collection = client[db][collection]
        
    # Task 7.2.17
    def find_by_date(self, date_string):
        """Find records in a PyMongo Collection created on a given date.

        Parameters
        ----------
        collection : pymongo.collection.Collection
            Collection in which to search for documents.
        date_string : str
            Date to query. Format must be '%Y-%m-%d', e.g. '2022-06-28'.

        Returns
        -------
        observations : list
            Result of query. List of documents (dictionaries).
        """
        # Convert `date_string` to datetime object
        start = pd.to_datetime(date_string, format='%Y-%m-%d')
        # Offset `start` by 1 day
        end = start + pd.DateOffset(days=1)
        # Create PyMongo query for no-quiz applicants b/t `start` and `end`
        query = {"createdAt": {"$gte": start, "$lt": end}, "admissionsQuiz": "incomplete"}
        # Query collection, get result
        result=self.collection.find(query)
        # Convert `result` to list
        observations = list(result)
        # REMOVE}
        return observations

    # Task 7.2.18
    def update_applicants(self, observations_assigned):
        """Update applicant documents in collection.

        Parameters
        ----------
        collection : pymongo.collection.Collection
            Collection in which documents will be updated.

        observations_assigned : list
            Documents that will be used to update collection

        Returns
        -------
        transaction_result : dict
            Status of update operation, including number of documents
            and number of documents modified.
        """
        # Initialize counters
        n = 0
        n_modified = 0

        # Iterate through applicannts
        for doc in observations_assigned:
            result = self.collection.update_one(filter={"_id": doc["_id"]},update={"$set": doc})
            n += result.matched_count
            n_modified += result.modified_count

        # create results
        transaction_result = {"n": n, "nModified": n_modified}

        return transaction_result
    
    # Task 7.2.19
    def assign_to_groups(self, date_string):
        """Randomly assigns observations to control and treatment groups.

        Parameters
        ----------
        observations : list or pymongo.cursor.Cursor
            List of users to assign to groups.

        Returns 
        -------
        observations : list
            List of documents from `observations` with two additional keys:
            `inExperiment` and `group`.
        """
        # Get observations
        observations = self.find_by_date(date_string)
        
        # Shuffle `observations`
        random.seed(42)
        random.shuffle(observations)

        # Get index position of item at observations halfway point
        idx = len(observations) // 2

        # Assign first half of observations to control group
        for doc in observations[:idx]:
            doc["inExperiment"] = True
            doc["group"] = "no email (control)"

        # Assign second half of observations to treatment group
        for doc in observations[idx:]:
            doc["inExperiment"] = True
            doc["group"] = "email (treatment)"
        
        # Update collection
        result = self.update_applicants(observations)

        return result

@micahondiwa April 2023