<font size="+3"><strong>2. Extract, Transform, Load</strong></font>

In the last lesson, we focused on exploratory data analysis. Specifically, we extracted information from our MongoDB database in order to describe some characteristics of the DS Lab applicant pool — country of origin, age, and education level. In this lesson, our goal is to design our experiment, and that means we'll need to go beyond extracting information. We'll also need to make some transformations in our data and then load it back into our database.

In Data Science and Data Engineering, the process of taking data from a source, changing it, and then loading it into a database is called **ETL**, which is short for **extract, transform, load**. ETL tends to be more programming-intensive than other data science tasks like visualization, so we'll also spend time in this lesson exploring Python as an **object-oriented programming** language. Specifically, we'll create our own Python **class** to contain our ETL processes.

<div class="alert alert-block alert-warning">
<b>Warning:</b> The database has changed since this videos for this lesson were filmed. So don't worry if you don't get exactly the same numbers as the instructor for the tasks in this project.
</div>

> Note: Replace `ip_of_mongo_device` with your actual MongoDB IP address.



In [None]:
import random

import pandas as pd
from IPython.display import VimeoVideo
from pymongo import MongoClient
from teaching_tools.ab_test.reset import Reset

r = Reset("192.7.148.2")
r.reset_database()

In [None]:
VimeoVideo("742770800", h="ce17b05c51", width=600)


# Connect


<div style="padding: 1em; border: 1px solid #f0ad4e; border-left: 6px solid #f0ad4e; background-color: #fcf8e3; color: #8a6d3b; border-radius: 4px;">

<strong>🛠️ Instruction:</strong> Locate the IP address of the machine running MongoDB and assign it to the variable <code>host</code>. Make sure to use a <strong>string</strong> (i.e., wrap the IP in quotes).<br><br>

<strong>⚠️ Note:</strong> The IP address is <strong>dynamic</strong> — it may change every time you start the lab. Always check the current IP before proceeding.

</div>
<img src="images/mongo_ip.png" alt="MongoDB" width="600"/>


As usual, the first thing we're going to need to do is get access to our data. 

**Task 7.2.1:** Assign the `"ds-applicants"` collection in the `"wqu-abtest"` database to the variable name `ds_app`.
> Note: When using the `MongoClient` class, specify the host by passing `host=host` as the first argument. For example: `MongoClient(host=host, port=27017)`
> 
> MongoDB server is running at `host` on port `27017`.

- [What's a MongoDB collection?](../%40textbook/11-databases-mongodb.ipynb#Collections)
- [Access a collection in a database using PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Collections)

In [55]:
client = MongoClient(host="192.7.148.2",port=27017)
db=client["wqu-abtest"]
ds_app = db["ds-applicants"]
print("client:", type(client))
print("ds_app:", type(ds_app))

client: <class 'pymongo.synchronous.mongo_client.MongoClient'>
ds_app: <class 'pymongo.synchronous.collection.Collection'>


# Extract: Developing the Hypothesis

Now that we've connected to the data, we need to pull out the information we need. One aspect of our applicant pool that we didn't explore in the last lesson is how many applicants actually complete the DS Lab admissions quiz.

In [None]:
VimeoVideo("734130688", h="637d2529dc", width=600)

**Task 7.2.2:** Use the [`aggregate`](https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html#pymongo.collection.Collection.aggregate) method to calculate the number of applicants that completed and did not complete the admissions quiz.

- [Perform aggregation calculations on documents using PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Aggregation)

In [56]:
result=ds_app.aggregate(
    [
        {
              "$group":{
                  "_id":"$admissionsQuiz","count":{"$count":{}}
              }
        }
    ]
)

In [None]:
list(result)

In [57]:
# How many applicants complete admissions quiz?

result=ds_app.aggregate(
    [
        {
              "$group":{
                  "_id":"$admissionsQuiz","count":{"$count":{}}
              }
        }
    ]
)
for r in result:
    if r["_id"]=="incomplete":
        incomplete=r["count"]
    else:
        complete=r["count"]



print("Completed quiz:", complete)
print("Did not complete quiz:", incomplete)

Completed quiz: 3717
Did not complete quiz: 1308


That gives us some raw numbers, but we're interested in participation *rates*, not participation numbers. Let's turn what we just got into a percentage.

In [None]:
VimeoVideo("734130558", h="b06dabae44", width=600)

**Task 7.2.3:** Using your results from the previous task, calculate the proportion of new users who have not completed the admissions quiz.

- [Perform basic mathematical operations in Python.](../%40textbook/01-python-getting-started.ipynb#Simple-Calculations)

In [58]:
total=complete+incomplete
prop_incomplete = incomplete/total
print(
    "Proportion of users who don't complete admissions quiz:", round(prop_incomplete, 2)
)

Proportion of users who don't complete admissions quiz: 0.26


Now that we know that around a quarter of DS Lab applicants don't complete the admissions quiz, is there anything we can do improve the completion rate? 

This is a question that we asked ourselves at WQU. In fact, here's a conversation between Nicholas and Anne (Program Director at WQU) where they identify the issue, come up with a hypothesis, and then decide how they'll conduct their experiment.

A **hypothesis** is an informed guess about what we think is going to happen in an experiment. We probably hope that whatever we're trying out is going to work, but it's important to maintain a healthy degree of skepticism. Science experiments are designed to demonstrate what *does* work, not what doesn't, so we always start out by assuming that whatever we're about to do won't make a difference (even if we hope it will). The idea that an experimental intervention won't change anything is called a **null hypothesis** ($H_0$), and every experiment either rejects the null hypothesis (meaning the intervention worked), or fails to reject the null hypothesis (meaning it didn't). 

The mirror image of the null hypothesis is called an **alternate hypothesis** ($H_a$), and it proceeds from the idea that whatever we're about to do actually *will* work. If I'm trying to figure out whether exercising is going to help me lose weight, the null hypothesis says that if I exercise, I won't lose any weight. The alternate hypothesis says that if I exercise, I will lose weight. 

It's important to keep both types of hypothesis in mind as you work through your experimental design.

In [None]:
VimeoVideo("734130136", h="e1c88a9ecd", width=600)

In [None]:
VimeoVideo("734131639", h="7e9aac1e60", width=600)

**Task 7.2.4:** Based on the discussion between Nicholas and Anne, write a null and alternate hypothesis to test in the next lesson.

- [What's a null hypothesis?](../%40textbook/20-statistics.ipynb#Hypothesis-Testing)
- [What's an alternate hypothesis?](../%40textbook/20-statistics.ipynb#Hypothesis-Testing)

In [None]:
null_hypothesis = ...

alternate_hypothesis =...

print("Null Hypothesis:", null_hypothesis)
print("Alternate Hypothesis:", alternate_hypothesis)

The next thing we need to do is figure out a way to filter the data so that we're only looking at students who applied on a certain date. This is a perfect chance to write a function!

In [None]:
VimeoVideo("734136019", h="227630f2d2", width=600)

**Task 7.2.5:** Create a function `find_by_date` that can search a collection such as `"ds-applicants"` and return all the no-quiz applicants from a specific date. Use the docstring below for guidance.

- Convert data to `datetime` using pandas.
- Perform a date offset using pandas. 
- [Select date ranges using the `$gt`, `$gte`, `$lt`, and `$lte` operators in PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)
- [Query a collection using PyMongo](../%40textbook/11-databases-mongodb.ipynb#Exploring-a-Database)

In [59]:
def find_by_date(collection, date_string):
    """Find records in a PyMongo Collection created on a given date.

    Parameters
    ----------
    collection : pymongo.collection.Collection
        Collection in which to search for documents.
    date_string : str
        Date to query. Format must be '%Y-%m-%d', e.g. '2022-06-28'.

    Returns
    -------
    observations : list
        Result of query. List of documents (dictionaries).
    """
    collection=ds_app
    date_string="2022-05-04"
    # Convert `date_string` to datetime object
    start = pd.to_datetime(date_string,format="%Y-%m-%d")
    # Offset `start` by 1 day
    end = start+pd.DateOffset(days=1)
    # Create PyMongo query for no-quiz applicants b/t `start` and `end`
    query = {"createdAt": {"$gte": start, "$lt": end}, "admissionsQuiz": "incomplete"}
    # Query collection, get result
    result = collection.find(query)
    # Convert `result` to list
    observations = list(result)
    return observations

2 May 2022 seems like as good a date as any, so let's use the function we just wrote to get all the students who applied that day.

In [None]:
VimeoVideo("734135947", h="172e5d7e19", width=600)

**Task 7.2.6:** Use your `find_by_date` function to create a list `observations` with all the new users created on **2 May 2022**.

- [What's a function?](../%40textbook/02-python-advanced.ipynb#Functions)

In [60]:
observations = find_by_date(ds_app,date_string="2022-05-05")

print("observations type:", type(observations))
print("observations len:", len(observations))
observations[0]

observations type: <class 'list'>
observations len: 48


{'_id': ObjectId('68a811f5eb54637f38f2cbfa'),
 'createdAt': datetime.datetime(2022, 5, 4, 1, 4),
 'firstName': 'Lindsay',
 'lastName': 'Schwartz',
 'email': 'lindsay.schwartz9@hotmeal.com',
 'birthday': datetime.datetime(1998, 5, 26, 0, 0),
 'gender': 'female',
 'highestDegreeEarned': "Bachelor's degree",
 'countryISO2': 'NG',
 'admissionsQuiz': 'incomplete',
 'group': 'email (control)',
 'inExperiment': True}

# Transform: Designing the Experiment

Okay! Now that we've extracted the data we'll need for the experiment, it's time to get our hands dirty. 

The **transform** stage of ETL involves manipulating the data we just extracted. In this case, we're going to be figuring out which students didn't take the quiz, and assigning them to different experimental groups. To do that, we'll need to *transform* each document in the database by creating a new attribute for each record.

Now we can split the students who didn't take the quiz into two groups: one that will receive a reminder email, and one that will not. Let's make another function that'll do that for us.

In [None]:
VimeoVideo("734134939", h="d7b409da4b", width=600)

**Task 7.2.7:** Create a function `assign_to_groups` that takes a list of new user documents as input and adds two keys to each document. The first key should be `"inExperiment"`, and its value should always be `True`. The second key should be `"group"`, with half of the records in `"email (treatment)"` and the other half in `"no email (control)"`.

- [Write a function in Python.](../%40textbook/02-python-advanced.ipynb#Functions)

In [61]:
def assign_to_groups(observations):
    """Randomly assigns observations to control and treatment groups.

    Parameters
    ----------
    observations : list or pymongo.cursor.Cursor
        List of users to assign to groups.

    Returns
    -------
    observations : list
        List of documents from `observations` with two additional keys:
        `inExperiment` and `group`.
    """
    # Shuffle `observations`
    random.seed(42)
    random.shuffle(observations)

    # Get index position of item at observations halfway point
    idx = len(observations)//2

    # Assign first half of observations to control group
  
    for doc in observations[:idx]:
        doc["inExperiment"]=True
        doc["group"]="no email (control)"
    # Assign second half of observations to treatment group
    for doc in observations[idx:]:
        doc["inExperiment"]=True
        doc["group"]="email (control)"

    return observations


observations_assigned = assign_to_groups(observations)

print("observations_assigned type:", type(observations_assigned))
print("observations_assigned len:", len(observations_assigned))
observations_assigned[0]

observations_assigned type: <class 'list'>
observations_assigned len: 48


{'_id': ObjectId('68a811f5eb54637f38f2d89a'),
 'createdAt': datetime.datetime(2022, 5, 4, 15, 59, 4),
 'firstName': 'Fernando',
 'lastName': 'Burns',
 'email': 'fernando.burns55@microsift.com',
 'birthday': datetime.datetime(1999, 1, 25, 0, 0),
 'gender': 'male',
 'highestDegreeEarned': 'Some College (1-3 years)',
 'countryISO2': 'NP',
 'admissionsQuiz': 'incomplete',
 'group': 'no email (control)',
 'inExperiment': True}

In the video, Anne said that she needs a CSV file with applicant email addresses. Let's automate that process with another function.

In [None]:
VimeoVideo("734137698", h="87610a6a1c", width=600)



**Task 7.2.8:** Create a function `export_email` that takes a list of documents (like `observations_assigned`) as input, creates a DataFrame with the emails of all observations in the treatment group, and saves the DataFrame as a CSV file. Then use your function to create a CSV file in the current directory.

- [Write a function in Python.](../%40textbook/02-python-advanced.ipynb#Functions)
- [Create a DataFrame from a Series in pandas.](../%40textbook/03-pandas-getting-started.ipynb#Working-with-DataFrames)
- [Save a DataFrame as a CSV file using pandas.](../%40textbook/03-pandas-getting-started.ipynb#Saving-a-DataFrame-as-a-CSV)

In [62]:
def export_treatment_emails(observations_assigned, directory="."):
    """Creates CSV file with email addresses of observations in treatment group.

    CSV file name will include today's date, e.g. `'2022-06-28_ab-test.csv'`,
    and a `'tag'` column where every row will be 'ab-test'.

    Parameters
    ----------
    observations_assigned : list
        Observations with group assignment.
    directory : str, default='.'
        Location for saved CSV file.

    Returns
    -------
    None
    """
    # Put `observations_assigned` docs into DataFrame
    df=pd.DataFrame(observations_assigned)
    # Add `"tag"` column
    df["tag"]="ab-test"
    # Create mask for treatment group only
    mask=df["group"]=="email (control)"
    # Create filename with date
    date_string=pd.Timestamp.now().strftime(format="%Y-%m-%d")
    filename = directory+"/"+date_string+"_ab-test.csv"

    # Save DataFrame to directory (email and tag only)
    df[mask][["email","tag"]].to_csv(filename,index=False)

export_treatment_emails(observations_assigned=observations_assigned)

# Load: Preparing the Data

We've *extracted* the data and written a bunch of functions we can use to *transform* the data, so it's time for the third part of this module: *loading* the data.

We've assigned the no-quiz applicants to groups for our experiment, so we should update the records in the `"ds-applicants"` collection to reflect that assignment. Before we update all our records, let's start with just one. 

In [None]:
VimeoVideo("734137546", h="e07cebf91e", width=600)

**Task 7.2.9:** Assign the first item in `observations_assigned` list to the variable `updated_applicant`. The assign that applicant's ID to the variable `applicant_id`.

- [What's a dictionary?](../%40textbook/01-python-getting-started.ipynb#Working-with-Dictionaries)
- [Access an item in a dictionary using Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-Dictionaries)

<div class="alert alert-info" role="alert">
    <p><b>Note:</b> The data in the database may have been updated since this video was recorded, so don't worry if you get a student other than "Raymond Brown".</p>
</div>

In [63]:
updated_applicant =observations_assigned[0]
applicant_id = updated_applicant["_id"]
print("applicant type:", type(updated_applicant))
print(updated_applicant)
print()
print("applicant_id type:", type(applicant_id))
print(applicant_id)

applicant type: <class 'dict'>
{'_id': ObjectId('68a811f5eb54637f38f2d89a'), 'createdAt': datetime.datetime(2022, 5, 4, 15, 59, 4), 'firstName': 'Fernando', 'lastName': 'Burns', 'email': 'fernando.burns55@microsift.com', 'birthday': datetime.datetime(1999, 1, 25, 0, 0), 'gender': 'male', 'highestDegreeEarned': 'Some College (1-3 years)', 'countryISO2': 'NP', 'admissionsQuiz': 'incomplete', 'group': 'no email (control)', 'inExperiment': True}

applicant_id type: <class 'bson.objectid.ObjectId'>
68a811f5eb54637f38f2d89a


Now that we have the unique identifier for one of the applicants, we can find it in the collection.

In [None]:
VimeoVideo("734137409", h="5ea2eaf949", width=600)

**Task 7.2.10:** Use the `find_one` method together with the `applicant_id` from the previous task to locate the original record in the `"ds-applicants"` collection.

- Access a class method in Python.

In [64]:
# Find original record for `applicant_id`
ds_app.find_one({"_id":applicant_id})

{'_id': ObjectId('68a811f5eb54637f38f2d89a'),
 'createdAt': datetime.datetime(2022, 5, 4, 15, 59, 4),
 'firstName': 'Fernando',
 'lastName': 'Burns',
 'email': 'fernando.burns55@microsift.com',
 'birthday': datetime.datetime(1999, 1, 25, 0, 0),
 'gender': 'male',
 'highestDegreeEarned': 'Some College (1-3 years)',
 'countryISO2': 'NP',
 'admissionsQuiz': 'incomplete',
 'group': 'no email (control)',
 'inExperiment': True}

And now we can update that document to show which group that applicant belongs to.

In [None]:
VimeoVideo("734141207", h="afe52c4d42", width=600)

**Task 7.2.11:** Use the `update_one` method to update the record with the new information in `updated_applicant`. Once you're done, rerun your query from the previous task to see if the record has been updated. 

- [Update one or more records in PyMongo.](../%40textbook/11-databases-mongodb.ipynb#Analyzing-Data)

In [65]:
result =ds_app.update_one(
    filter={"_id":applicant_id},
    update={"$set":updated_applicant}
)
print("result type:", type(result))

result type: <class 'pymongo.results.UpdateResult'>


Note that when we update the document, we get a `result` back. Before we update multiple records, let's take a moment to explore what `result` is — and how it relates to object oriented programming in Python.

In [None]:
VimeoVideo("734142198", h="eabd16f09e", width=600)

**Task 7.2.12:** Use the [`dir`](https://docs.python.org/3/library/functions.html#dir) function to inspect `result`. Once you see some of the attributes, try to access them. For instance, what does the `raw_result` attribute tell you about the success of your record update?

- [What's a class?](../%40textbook/21-python-object-oriented-programming.ipynb#Classes)
- [What's a class attribute?](../%40textbook/21-python-object-oriented-programming.ipynb#Attributes)
- [Access a class attribute in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods)

In [66]:
# Access methods and attributes using `dir`
dir(result)
# Access `raw_result` attribute
result.raw_result

{'n': 1, 'nModified': 0, 'ok': 1.0, 'updatedExisting': True}

We know how to update a record, and we can interpret our operation results. Since we can do it for one record, we can do it for all of them! So let's update the records for all the observations in our experiment. 

In [None]:
VimeoVideo("734147474", h="4e38b07a71", width=600)

**Task 7.2.13:** Create a function `update_applicants` that takes a list of document like as input, updates the corresponding documents in a collection, and returns a dictionary with the results of the update. Then use your function to update `"ds-applicants"` with `observations_assigned`.

- [Write a function in Python.](../%40textbook/02-python-advanced.ipynb#Functions)
- [Write a `for` loop in Python.](../%40textbook/01-python-getting-started.ipynb#Working-with-for-Loops)

In [67]:
def update_applicants(collection, observations_assigned):
    """Update applicant documents in collection.

    Parameters
    ----------
    collection : pymongo.collection.Collection
        Collection in which documents will be updated.

    observations_assigned : list
        Documents that will be used to update collection

    Returns
    -------
    transaction_result : dict
        Status of update operation, including number of documents
        and number of documents modified.
    """
    n=0
    n_modified=0
    for doc in observations_assigned:
        result =collection.update_one(
            filter={"_id":doc["_id"]},
            update={"$set":doc}
        )
        n += result.matched_count
        n_modified+=result.modified_count
    transaction_result=f"n:{n},modified:{n_modified}"    
    return transaction_result

In [68]:
result = update_applicants(ds_app, observations_assigned)
print("result type:", type(result))
result

result type: <class 'str'>


'n:48,modified:0'

Note that if you run the above cell multiple times, the value for `result["nModified"]` will go to `0`. This is because you've already updated the documents. 

# Python Classes: Building the Repository

We've managed to extract data from our database using our `find_by_date` function, transform it using our `assign_to_groups` function, and load it using our `update_applicants` function. Does that mean we're done? Not yet! There's an issue we need to address: distraction.

What do we mean when we say distraction? Think about it this way: Do you need to know the exact code that makes `df.describe()` work? No, you just need to calculate summary statistics! Going into more details would distract you from the work you need to get done. The same is true of the tools you've created in this lesson. Others will want to use them in future experiments with worrying about your implementation. The solution is to **abstract** the details of your code away.

To do this we're going to create a [Python class.](https://docs.python.org/3/tutorial/classes.html) Python classes contain both information and ways to interact with that information. An example of class is a pandas `DataFrame`. Not only does it hold data (like the size of an apartment in Buenos Aires or the income of a household in the United States); it also provides methods for inspecting it (like `DataFrame.head()` or `DataFrame.info()`) and manipulating it (like `DataFrame.sum()` or `DataFrame.replace()`). 

In the case of this project, we want to create a class that will hold information about the documents we want (like the name and location of the collection) and provide tools for interacting with those documents (like the functions we've built above). Let's get started!

In [None]:
VimeoVideo("734133492", h="a0f97831a1", width=600)

In [None]:
VimeoVideo("734133039", h="070a04dd1c", width=600)

**Task 7.2.14:** Define a `MongoRepository` class with an `__init__` method. The `__init__` method should accept three arguments: `client`, `db`, and `collection`. Use the docstring below as a guide.

> Note: When initializing `MongoClient`, make sure to pass the `host` parameter like this: `host=host`.
- [Write a class definition in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Classes)
- [Write a class method in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods) 

In [73]:
class MongoRepository:
    """Repository class for interacting with MongoDB database.

    Parameters
    ----------
    client : `pymongo.MongoClient`
        By default, `MongoClient(host=host, port=27017)`.
    db : str
        By default, `'wqu-abtest'`.
    collection : str
        By default, `'ds-applicants'`.

    Attributes
    ----------
    collection : pymongo.collection.Collection
        All data will be extracted from and loaded to this collection.
    """
    def __init__(
        self,
        client=MongoClient(host="192.7.148.2",port=27017),
        db="wqu-abtest",
        collection="ds-applicants"
    ):
        self.collection=client[db][collection]

    # Task 7.2.14
    def find_by_date(self,date_string):
    
    
    # Convert `date_string` to datetime object
        start = pd.to_datetime(date_string,format="%Y-%m-%d")
    # Offset `start` by 1 day
        end = start+pd.DateOffset(days=1)
    # Create PyMongo query for no-quiz applicants b/t `start` and `end`
        query = {"createdAt": {"$gte": start, "$lt": end}, "admissionsQuiz": "incomplete"}
    # Query collection, get result
        result = self. collection.find(query)
    # Convert `result` to list
        observations = list(result)
        return observations

    # Task 7.2.17
    def update_applicants(self, observations_assigned):
    
        n=0
        n_modified=0
        for doc in observations_assigned:
            result =self.collection.update_one(
                filter={"_id":doc["_id"]},
                update={"$set":doc}
            )
            n += result.matched_count
            n_modified+=result.modified_count
        transaction_result=f"n:{n},modified:{n_modified}"    
        return transaction_result
    # Task 7.2.18
    def assign_to_groups(self,date_string):
        observations=self.find_by_date(date_string)
        # Shuffle `observations`
        random.seed(42)
        random.shuffle(observations)
    
        # Get index position of item at observations halfway point
        idx = len(observations)//2
    
        # Assign first half of observations to control group
      
        for doc in observations[:idx]:
            doc["inExperiment"]=True
            doc["group"]="no email (control)"
        # Assign second half of observations to treatment group
        for doc in observations[idx:]:
            doc["inExperiment"]=True
            doc["group"]="email (treatment)"
        result=self.update_applicants(observations)    
    
        return result
    
    
    
    
    # Task 7.2.19
   

Now that we have a class definition, we can do all sorts of interesting things. The first thing to do is instantiate the class...

In [None]:
VimeoVideo("734150578", h="2caaa53d03", width=600)

**Task 7.2.15:** Create an instance of your `MongoRepository` and assign it to the variable name `repo`.

In [74]:
repo = MongoRepository()
print("repo type:", type(repo))
repo

repo type: <class '__main__.MongoRepository'>


<__main__.MongoRepository at 0x7d06d48ee110>

...and then we can look at the attributes of the collection.

In [None]:
VimeoVideo("734150427", h="f9c9433ff6", width=600)

**Task 7.2.16:** Extract the `collection` attribute from `repo`, and assign it to the variable `c_test`. Is the `c_test` the correct data type?

- [Access a class attribute in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods)

In [75]:
c_test = repo.collection
print("c_test type:", type(c_test))
c_test

c_test type: <class 'pymongo.synchronous.collection.Collection'>


Collection(Database(MongoClient(host=['192.7.148.2:27017'], document_class=dict, tz_aware=False, connect=True), 'wqu-abtest'), 'ds-applicants')

Our class is built, and now we need to take the ETL functions we created and turn them into **class methods**. Think back to the beginning of the course, where we learned how to work with DataFrames. If we call a DataFrame `df`, we can use methods designed by other people to figure out what's inside. We've learned lots of those methods already — `df.head()` `df.info()`, etc. — but we can also create our own. Let's give it a try.

In [None]:
VimeoVideo("734150075", h="82f7810cd0", width=600)

**Task 7.2.17:** Using your function as a model, create a `find_by_date` method for your `MongoRepository` class. It should take only one argument: `date_string`. Once you're done, test your method by extracting all the users who created account on 15 May 2022.

- [Access a class method in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods)

In [76]:
may_15_users = repo.find_by_date(date_string="2022-05-15")
print("may_15_users type", type(may_15_users))
print("may_15_users len", len(may_15_users))
may_15_users[:3]

may_15_users type <class 'list'>
may_15_users len 30


[{'_id': ObjectId('68a811f5eb54637f38f2cc0d'),
  'createdAt': datetime.datetime(2022, 5, 15, 20, 21, 12),
  'firstName': 'Patrick',
  'lastName': 'Derosa',
  'email': 'patrick.derosa81@hotmeal.com',
  'birthday': datetime.datetime(2000, 9, 30, 0, 0),
  'gender': 'male',
  'highestDegreeEarned': "Bachelor's degree",
  'countryISO2': 'UA',
  'admissionsQuiz': 'incomplete'},
 {'_id': ObjectId('68a811f5eb54637f38f2cd2c'),
  'createdAt': datetime.datetime(2022, 5, 15, 10, 50, 56),
  'firstName': 'Deidre',
  'lastName': 'Pagan',
  'email': 'deidre.pagan75@hotmeal.com',
  'birthday': datetime.datetime(1996, 12, 2, 0, 0),
  'gender': 'female',
  'highestDegreeEarned': "Bachelor's degree",
  'countryISO2': 'ZW',
  'admissionsQuiz': 'incomplete'},
 {'_id': ObjectId('68a811f5eb54637f38f2cdbf'),
  'createdAt': datetime.datetime(2022, 5, 15, 5, 8, 35),
  'firstName': 'Harry',
  'lastName': 'Ellis',
  'email': 'harry.ellis78@microsift.com',
  'birthday': datetime.datetime(2000, 2, 6, 0, 0),
  'gende

Good work! Let's try it again!

In [None]:
VimeoVideo("734149871", h="4db7c08002", width=600)

**Task 7.2.18:** Using your function as a model, create an `update_applicants` method for your `MongoRepository` class. It should take one argument: `documents`. To test your method, use the function to update the documents in `observations_assigned`. 

- [Access a class method in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods)

In [77]:
result = repo.update_applicants(observations_assigned)
print("result type:", type(result))
result

result type: <class 'str'>


'n:48,modified:0'

Let's make another one!<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>

In [None]:
VimeoVideo("734149186", h="65f443159c", width=600)

**Task 7.2.19:** Create an `assign_to_groups` method for your `MongoRepository` class. Note that it should work differently than your original function. It will take one argument: `date_string`. It should find users from that date, assign them to groups, update the database, and return the results of the transaction. Once you're done, use your method to assign all the users who created account on **14 May 2022**, to groups.

- [Access a class method in Python.](../%40textbook/21-python-object-oriented-programming.ipynb#Methods)

In [78]:
result =repo.assign_to_groups(date_string="2022-05-15")
print("result type:", type(result))
result

result type: <class 'str'>


'n:30,modified:30'

We'll leave it to you to implement an `export_treatment_emails` method. For now, let's submit your class to the grader. 

In [83]:
VimeoVideo("734148753", h="2305068b6b", width=600)

**Task 7.2.20:** Run the cell below, to create a new instance of your `MongoRepository` class, assign users from **16 May 2022** to groups, and submit the results to the grader.

In [90]:
repo_test = MongoRepository()
submission = repo_test.assign_to_groups("2022-05-01")
submission

'n:37,modified:0'

In [92]:
repo_test = MongoRepository()
submission = repo_test.assign_to_groups("2022-05-16")

---
Copyright 2023 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
