## Retrieving and cleaning data


This notebook can be used to get the data from your experiment saved as a CSV
file. You need to put the SQL file you receive in the phpmyadmin tool (see
Canvas on instructions how to do this) and then run this code. Watch out!
Sometimes you need to make small adaptations (indicated in the notebook)


In [None]:
!pip install sqlalchemy

In [150]:
import pandas as pd
from sqlalchemy import create_engine, inspect

The following code connects to the SQL database from phpmyadmin and gets the
names of all the tables that are present. For each group there should be several
tables such as "calibration", "userconsent" or "propositions" -- but there might
be some columns only present for some of the groups.

**!! IMPORTANT !! You need to change the first command to reflect the name that
you have given the database. In this example the database was called "group25",
replace this with the name the database has in your phpmyadmin**


In [None]:
# Create a connection to the MySQL database using SQLAlchemy
engine = create_engine('mysql+pymysql://root:@localhost/group25')

# Use the inspect module to get table names
inspector = inspect(engine)
tables = inspector.get_table_names()

# Print the list of all tables
print(tables)

In the next code cell we are selecting all the columns starting with "user"
except for the "users" column (as it does not include any relevant information
for us). Now we have a list of all the columns that we later want to retrieve
data for. By taking this approach, our code works for all kinds of different
columns starting with "user" since they are different from group to group.


In [152]:
# Select all the user columns
usercols = [n for n in tables if n.startswith('user') and n!='users']

Since we also want to know to which statements the different likert scale items
correspond (important for later analysis), we can retrieve a kind of "codebook"
for this from the database. The information about the likert scale items can be
stored in a few different tables: pre_propositions (asked before the
calibration/recommendation), propositions (asked after the recommendation), and
sometimes propositions1 if there were more questions than fit on the pages. Your
group might have one, two, or all of these tables. Running the code below
outputs the saved (pre-)propositions with their id, question wording, and scale
(usually 5 or 7 point). You can use the id to look up the matching column in the
csv dataframe later.

With the code below we are retrieving the tables, loading them into pandas
dataframes and adding a prefix to the item names (pre* or post*) as often the
IDs are the same for prepropositions and propositions, so we want to be able to
keep them apart later.


In [None]:
if 'pre_propositions' in tables:    
    # Query to select all items from userconsent table
    query = "SELECT * FROM pre_propositions"

    # Load data into a pandas DataFrame
    pre_propositions = pd.read_sql(query, engine)
    prefix = 'pre_'
    string_columns = ['id']

    for col in string_columns:
        pre_propositions[col] = pre_propositions[col].apply(lambda x: f"{prefix}{x}")

    # Print the DataFrame
    print(pre_propositions)

In [None]:
if 'propositions' in tables:        
    # Query to select all items from userconsent table
    query = "SELECT * FROM propositions"

    # Load data into a pandas DataFrame
    propositions = pd.read_sql(query, engine)
    prefix = 'post_'
    string_columns = ['id']

    for col in string_columns:
        propositions[col] = propositions[col].apply(lambda x: f"{prefix}{x}")

    # Print the DataFrame
    print(propositions)

In [None]:
if 'propositions1' in tables:    
    # Query to select all items from userconsent table
    query = "SELECT * FROM propositions1"

    # Load data into a pandas DataFrame
    propositions1 = pd.read_sql(query, engine)
    prefix = 'post_'
    string_columns = ['id']

    for col in string_columns:
        propositions1[col] = propositions1[col].apply(lambda x: f"{prefix}{x}")

    # Print the DataFrame
    print(propositions1)

In the next cell we are defining a function that makes it possible for us later
to also add a prefix to column names which we will use later.


In [156]:
# Function to add prefix to all columns except 'userId'
def add_prefix_except_userid(df, prefix):
    return df.rename(columns={col: f"{prefix}{col}" if col != "userId" else col for col in df.columns})

In the following cell we are now finally retrieving our data: We take the names
of all the columns we identified as relevant (usercols), retrieve their content,
and put them all in a list. There are two types of tables that first need some
more work: userpresatisfaction(1) and usersatisfaction(1). These are the likert
scale items and they are stored in so-called long format (meaning that every
person appears in as many rows as are items in the likert scale, so if there is
a likert scale with 14 items there are 14 rows per study participant). As the
long format is not very useful for us to further work with the data, we convert
it into wide format (you can see the command below, .pivot) so that all the
tables are in the same format.


In [157]:
dfs = []
for col in usercols:
    query = f"SELECT * FROM {col}"
    tablecontent = pd.read_sql(query, engine)
    if col in ['userpresatisfaction', 'userpresatisfaction1']:
        tablecontent = tablecontent.pivot(index='userId', columns='questionId', values='value').reset_index()
        tablecontent.columns.name = None
        add_prefix_except_userid(tablecontent, 'pre_')
    elif col in ['usersatisfaction', 'usersatisfaction1']:
        tablecontent = tablecontent.pivot(index='userId', columns='questionId', values='value').reset_index()
        tablecontent.columns.name = None
        add_prefix_except_userid(tablecontent, 'post_')
    dfs.append(tablecontent)

Now we merge all the tables we collected in the list together on userId so that
we have one row per user with all the variables in one column each. We also
further remove some variables that are not of interest to us (such as time
variables) to further clean the dataframe


In [158]:
# Merge all DataFrames on the userId column
merged_df = dfs[0]
for df in dfs[1:]:
    merged_df = pd.merge(merged_df, df, on='userId', how="outer")
merged_df = merged_df.loc[:,~merged_df.columns.str.startswith('time')]

As a last step, we save the dataframe to a CSV file. We have now called this
dataframe final_df.csv, you can change the name if you want -- just make sure
you can find the dataframe after you have saved it. With this code, it is stored
in the same folder as your Python script.


In [160]:
merged_df.to_csv('final_df.csv', index=False)