# Notebook Documentation

## Goal
The purpose of this notebook is to facilitate the upload and processing of background data from various survey providers, including Cint Access, Syno Distribution, Lucid, and PureSpectrum. Additionally, the notebook will handle the import of .sav files containing exported raw data, including non-completes, for analysis.

## Process Overview
1. **Data Upload**: Users will upload background data from the specified providers. This data typically includes respondent IDs, survey completion status, and other metadata.
2. **Data Processing**: The notebook will process the uploaded data to prepare it for analysis. This may involve data cleaning, transformation, and merging of datasets from different sources.
3. **Analysis of .sav Files**: The notebook will import and analyze .sav files, which are data files from SPSS (Statistical Package for the Social Sciences). These files contain survey responses, including those from respondents who did not complete the survey.
4. **Reconciliation**: The final step is to generate an Excel file containing IDs that will be used to reconcile responses with the data collected by the Syno Survey tool.

## Expected Outputs
- An Excel file with reconciled IDs for matching survey responses with the collected data.

## User Instructions
- Users should upload all background data files from the specified providers to the notebook.
- Users should also provide all .sav files from the exported raw data for analysis.
- The notebook will process the data and generate the required Excel file for reconciliation.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import pandas as pd 
import numpy as np 
import pyreadstat

from datetime import datetime
import os

# Uncomment if using an .env file to store secrets
from dotenv import load_dotenv
load_dotenv()

True

1. Read the data in SPSS format

In [2]:
def read_data(directory = os.environ["DATA_DIR"]):
    """
    Read the data from the specified folder
    
    Params: 
        directory
            - Default (data), if not specified. Can be overwritten by another location
    """
    print(f"Reading from directory: {directory}")
    data = pd.DataFrame()

    filenames = os.listdir(directory)
    for filename in filenames:
        print(f"Reading file: {filename}")
        sav, meta = pyreadstat.read_sav(f"{directory}/{filename}", apply_value_formats=True)
        data = pd.concat([data, sav], axis = 0)
        
    return data

In [3]:
data = read_data()

Reading from directory: data/
Reading file: Raw Data 156910 2024-02-24.sav
Reading file: Raw Data 174031 2024-02-24.sav
Reading file: Raw Data 222861 2024-02-24.sav
Reading file: Raw Data 665730 2024-02-24.sav
Reading file: Raw Data 818418 2024-02-24.sav


2. Read the distribution files and merge them all

In [4]:
def read_distribution(directory = os.environ["DISTRIBUTION_DIR"]):
    """
    Reads a csv file from a specified directory
    
    Params: 
        directory
            - By default the directory `data` is read. It can be overwritten by another location
    """
    print(f"Reading from directory: {directory}")

    # Concat all distribution files
    distribution_bg = pd.DataFrame()

    for distribution_file in os.listdir(directory):
        print(f"Reading file: {distribution_file}")
        # Read the current filename 
        df = pd.read_csv(directory + distribution_file)
        # Filter out columns of interesst
        df.drop("Unnamed: 8", axis = "columns", inplace=True)
        df.rename(columns={"GUID" : "guid"}, inplace=True)
        # Concat dataframes by rows
        distribution_bg = pd.concat([distribution_bg, df], axis = "index")

    return distribution_bg

In [5]:
distribution = read_distribution()

Reading from directory: distribution/
Reading file: 4300 Encuesta Satisfacción LATAM sept 2023_2024-02-24.csv
Reading file: P4300 Encuesta Satisfacción Latam - January 2024 - Reinvites_2024-02-24.csv
Reading file: P4300 Encuesta Satisfacción Latam - January 2024_2024-02-24.csv


3. Read the purespectrum files and merge them all

In [6]:
def read_purespectrum(directory = os.environ["PURESPECTRUM_DIR"]):
    print(f"Reading from directory: {directory}")
    
    # Concat all closed files
    purespectrum_bg = pd.DataFrame()

    for purespectrum_file in os.listdir(directory):
        print(f"Reading file: {purespectrum_file}")
        # Read the current filename 
        df = pd.read_csv(directory + purespectrum_file)
        # Filter out columns of interest
        df = df[["Survey ID", "Project Name", "PSID", "Transaction ID", "Survey Country", "Survey Language", "Respondent Status Description", "IP", "UserAgent"]].reset_index(drop=True)
        df.rename(columns={"Transaction ID" : "guid"}, inplace=True)
        # Concat dataframes by rows
        purespectrum_bg = pd.concat([purespectrum_bg, df], axis = "index")

    return purespectrum_bg

In [7]:
purespectrum = read_purespectrum()

Reading from directory: purespectrum/
Reading file: survey_All_0.csv


  df = pd.read_csv(directory + purespectrum_file)


In [8]:
def pureRemoves(pure, data):
    # Completes that we must have as complete in PureSpectrum
    pure_match = pure[pure["guid"].isin( data[ (data["source"] == "Pure Spectrum") & (data["status"] == "complete") & (data["mode"] == "live") ]["guid"] )]
    # Respondents to remove in PureSpectrum
    pure_removes = pure[
        (~pure["guid"].isin( pure_match["guid"])) &  # Filter non matched Survey completes
        (pure["Respondent Status Description"] == "Complete") # Filter only completes in Pure that did not matched in Surveys
    ]

    return pure_removes

In [9]:
def pureAdds(pure, data):
    # Completes that we must have as complete in PureSpectrum
    pure_match = pure[pure["guid"].isin( data[ (data["source"] == "Pure Spectrum") & (data["status"] == "complete") & (data["mode"] == "live") ]["guid"] )]
    # Respondents to add in PureSpectrum
    pure_add = pure_match[pure_match["Respondent Status Description"] != "Complete"]
    # pure_add.to_excel("pure_add.xlsx", index = False)
    
    return pure_add

In [10]:
def distributionRemoves(distribution, data):
    # Completes that we must have as completes in Distribution
    distribution_match = distribution[distribution["guid"].isin( data[ (data["source"].isin(["Cint", "Syno"])) & (data["status"] == "complete") & (data["mode"] == "live") ]["guid"] )]

    # Respondents to remove in Distribution
    distribution_removes = distribution[
        (~distribution["guid"].isin( distribution_match["guid"] )) &
        (distribution["Status"] == "Complete")
    ]

    return distribution_removes

In [11]:
def distributionAdds(distribution, data):
    # Completes that we must have as completes in Distribution
    distribution_match = distribution[distribution["guid"].isin( data[ (data["source"].isin(["Cint", "Syno"])) & (data["status"] == "complete") & (data["mode"] == "live") ]["guid"] )]

    # Respondents to add in Distribution
    distribution_add = distribution_match[distribution_match["Status"] != "Complete"]
    
    return distribution_add

In [12]:
def format_sheet(writer, sheet_name, data):
    workbook = writer.book
    worksheet = writer.sheets[sheet_name]
    cell_format = workbook.add_format({'font_name': 'Arial', 'font_size': 9})  # This line is correct, assuming the engine is 'xlsxwriter'
    for col_num, value in enumerate(data.columns.values):
        worksheet.set_column(col_num, col_num, 12.5, cell_format)  # Example of setting column width and format

In [13]:
# Get the current date in the format MM_DD_YYYY
current_date = datetime.now().strftime("%m_%d_%Y")

# Use the current date in the filename
filename = f"Reconciliation - {current_date}.xlsx"

with pd.ExcelWriter(filename, engine='xlsxwriter') as writer:  # Specify the engine
    print("Exporting the data to an Excel file")
    
    # PureSpectrum reconciliations
    if len(purespectrum) > 0:
        print("Generating PureSpectrum Additions")
        pure_additions = pureAdds(purespectrum, data)
        pure_additions.to_excel(writer, sheet_name="PureSpectrum - Add", index=False)
        format_sheet(writer, "PureSpectrum - Add", pure_additions)
        
        print("Generating PureSpectrum Removes")
        pure_removes = pureRemoves(purespectrum, data)  # Corrected typo in 'purespectrrum'
        pure_removes.to_excel(writer, sheet_name="PureSpectrum - Remove", index=False)
        format_sheet(writer, "PureSpectrum - Remove", pure_removes)
    
    # Distribution reconciliations
    if len(distribution) > 0:
        print("Generating Distribution Additions")
        distribution_additions = distributionAdds(distribution, data)
        distribution_additions.to_excel(writer, sheet_name="Distribution - Add", index=False)
        format_sheet(writer, "Distribution - Add", distribution_additions)
        
        print("Generating Distribution Removes")
        distribution_removes = distributionRemoves(distribution, data)
        distribution_removes.to_excel(writer, sheet_name="Distribution - Remove", index=False)
        format_sheet(writer, "Distribution - Remove", distribution_removes)
    
    print(f"File exported as {filename}")

Exporting the data to an Excel file
Generating PureSpectrum Additions
Generating PureSpectrum Removes
Generating Distribution Additions
Generating Distribution Removes
File exported as Reconciliation - 02_24_2024.xlsx
