# CiC Research : Raw survey into bronze

This notebook, ingest the survey raw data. It creates a local `sqlite` database to store the data.

## Research and activities

- convert excel to json or sqlite. Done - sqlite has better performance and ability to query the data, it has lower disk footprint.
- helper method for sqlite (query, non-query)
- tried to understand the column selected for the open-ended excel files.
- seem like most of the fields in the 'open-ended' excel files are calculated. for example there is no `country` rather longtitude and latitude - tried to create a method that would translate the longtitude and latitude to country.

## Excel / CSV 2 JSON / SQLite

The content provided by `Ipsos` team is an excel file. 
We explored the excel2json and excel2sqlite. We decided to use sqlite as it has better performance and ability to query the data in a well known query language. 

In [None]:
#r "nuget: ClosedXML, 0.104.2"
#r "nuget: Microsoft.Data.Sqlite, 6.0.0"

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using ClosedXML.Excel;
using System.Text.Json;

using Microsoft.Data.Sqlite;

### 📄 Excel Conversion Utilities

This notebook includes two utility methods for converting Excel files into more accessible formats for further analysis:

---

#### ✅ `ConvertExcelToJsonStream(string excelFilePath, string outputJsonPath)`
This method reads the first worksheet of an Excel file and writes its contents to a JSON file as an array of objects.  
- The first row is assumed to be the header and used as JSON property names.  
- Each subsequent row becomes a JSON object, with cell values converted to strings.  
- Data is written to the output stream efficiently with periodic flushing to manage memory.

**Use case:** Preparing Excel data for JSON-based pipelines, APIs, or lightweight visualization tools.

---

#### ✅ `ConvertExcelToSQLite(string excelFilePath, string sqliteDbPath)`
This method reads an Excel file and stores its contents into a SQLite database table named `SurveyResponses`.  
- The first row defines the column names (all stored as `TEXT`).  
- Empty values and `#NULL!` strings are normalized to `"N/A!"`.  
- Insertions are wrapped in a transaction for better performance.

**Use case:** Loading structured Excel data into a local SQLite DB for querying, filtering, or joining with other data sources.

In [None]:
public bool ConvertExcelToJsonStream(string excelFilePath, string outputJsonPath)
{
    try
    {
        using var workbook = new XLWorkbook(excelFilePath);
        var worksheet = workbook.Worksheet(1);

        // Read the header row dynamically
        var headerRow = worksheet.FirstRowUsed();
        var headers = headerRow.CellsUsed().Select(c => c.GetString()).ToList();

        // Open the output file stream
        using var fs = new FileStream(outputJsonPath, FileMode.Create, FileAccess.Write, FileShare.None);
        var jsonWriterOptions = new JsonWriterOptions { Indented = true };
        using var writer = new Utf8JsonWriter(fs, jsonWriterOptions);

        writer.WriteStartArray();
        int rowCount = 0;

        // Process each row after the header
        foreach (var row in worksheet.RowsUsed().Skip(1))
        {
            writer.WriteStartObject();
            int colIndex = 0;
            foreach (var cell in row.Cells(1, headers.Count))
            {
                // Convert each cell's value to string (you could add type checking if needed)
                string value = cell.Value.ToString() ?? "";
                writer.WriteString(headers[colIndex], value);
                colIndex++;
            }
            writer.WriteEndObject();
            rowCount++;

            // Flush periodically (every 100 rows in this example) to reduce memory pressure
            if (rowCount % 100 == 0)
            {
                writer.Flush();
            }
        }

        writer.WriteEndArray();
        writer.Flush();
        return true;
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error during conversion: {ex.Message}");
        return false;
    }
}

In [None]:
public bool ConvertExcelToSQLite(string excelFilePath, string sqliteDbPath)
{
    try
    {
        // Open the Excel workbook using ClosedXML
        using (var workbook = new XLWorkbook(excelFilePath))
        {
            // Get the first worksheet (adjust if needed)
            var worksheet = workbook.Worksheet(1);

            // Read the header row dynamically
            var headerRow = worksheet.FirstRowUsed();
            var headers = headerRow.CellsUsed().Select(c => c.GetString()).ToList();

            // Open (or create) the SQLite database file
            using (var connection = new SqliteConnection($"Data Source={sqliteDbPath}"))
            {
                connection.Open();

                // Create a table with all columns as TEXT.
                // Use square brackets around column names to handle spaces or special characters.
                var columnsDef = string.Join(", ", headers.Select(h => $"[{h}] TEXT"));
                var createTableSql = $"CREATE TABLE IF NOT EXISTS SurveyResponses ({columnsDef});";
                using (var cmd = new SqliteCommand(createTableSql, connection))
                {
                    cmd.ExecuteNonQuery();
                }

                // Build an INSERT statement with parameters for each column.
                var columnsList = string.Join(", ", headers.Select(h => $"[{h}]"));
                var paramList = string.Join(", ", headers.Select((h, i) => $"@p{i}"));
                var insertSql = $"INSERT INTO SurveyResponses ({columnsList}) VALUES ({paramList});";

                // Wrap the insertion in a transaction for better performance.
                using (var transaction = connection.BeginTransaction())
                using (var insertCmd = new SqliteCommand(insertSql, connection, transaction))
                {
                    // Pre-add the parameters to the command.
                    for (int i = 0; i < headers.Count; i++)
                    {
                        insertCmd.Parameters.Add(new SqliteParameter($"@p{i}", ""));
                    }

                    // Process each row (skip the header row)
                    foreach (var row in worksheet.RowsUsed().Skip(1))
                    {
                        int colIndex = 0;
                        foreach (var cell in row.Cells(1, headers.Count))
                        {
                            // Read the cell value as a string.
                            string value = cell.Value.ToString();

                            // Replace empty strings or "#NULL!" with "N/A!"
                            if (string.IsNullOrEmpty(value) || value == "#NULL!")
                            {
                                value = "N/A!";
                            }
                            insertCmd.Parameters[$"@p{colIndex}"].Value = value;
                            colIndex++;
                        }
                        insertCmd.ExecuteNonQuery();
                    }
                    transaction.Commit();
                }
                connection.Close();
            }
        }
        return true;
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error during conversion: {ex.Message}");
        return false;
    }
}

### Excel2Json

Using the method to convert the excel to a json file. This needs to be executed only once.


In [None]:
var excelFilePath = "FY25H1 Raw Data with labels US and UK.xlsx";
var outputJsonPath = "fy25-raw.json";

if ( ConvertExcelToJsonStream(excelFilePath, outputJsonPath))
{
    Console.WriteLine($"Excel file '{excelFilePath}' was successfully converted to JSON and saved to '{outputJsonPath}'");
}
else
{
    Console.WriteLine($"Failed to convert '{excelFilePath}' to JSON");
}

### Excel2Sqlite

Using the method to convert the excel to a sqlite file.

In [None]:
var excelFilePath = "FY25H1 Raw Data with labels US and UK.xlsx";
var outputJsonPath = "fy25-raw.db";

if ( ConvertExcelToSQLite(excelFilePath, outputJsonPath))
{
    Console.WriteLine($"Excel file '{excelFilePath}' was successfully converted to sqllite and saved to '{outputJsonPath}'");
}
else
{
    Console.WriteLine($"Failed to convert '{excelFilePath}' to sql lite");
}

## SQLiteHelper Overview

- **LoadDatabase**: Opens a connection to the specified SQLite file (or in-memory DB).
- **ExecuteNonQuery**: Runs SQL statements that don't return rows (e.g., CREATE, INSERT, UPDATE, DELETE).
- **ExecuteQuery**: Runs a SQL query and returns each row as a dictionary of column-value pairs.
- **CreateTableOrView**: Creates a new table or view using a given SQL script.
- **DropTableOrView**: Drops an existing table or view if it exists.
- **UpdateTableRow**: Updates a single row in a table by matching a key column and setting new column values.

In [None]:
// using System;
// using System.Collections.Generic;
// using Microsoft.Data.Sqlite;

/// <summary>
/// Helper class for working with SQLite databases.
/// </summary>
public static class SQLiteHelper
{
    /// <summary>
    /// Loads a SQLite database from a file and returns an open connection.
    /// </summary>
    /// <param name="sqliteDbPath">Path to the SQLite database file.</param>
    /// <returns>An open <see cref="SqliteConnection"/>.</returns>
    public static SqliteConnection LoadDatabase(string sqliteDbPath)
    {
        var connection = new SqliteConnection($"Data Source={sqliteDbPath}");
        connection.Open();
        return connection;
    }

    /// <summary>
    /// Executes a non-query SQL statement (e.g., INSERT, UPDATE, DELETE, or DDL) against the database.
    /// </summary>
    /// <param name="connection">An open <see cref="SqliteConnection"/>.</param>
    /// <param name="sql">The SQL statement to execute.</param>
    public static void ExecuteNonQuery(SqliteConnection connection, string sql)
    {
        using var command = connection.CreateCommand();
        command.CommandText = sql;
        command.ExecuteNonQuery();
    }

    /// <summary>
    /// Executes a SQL query that returns rows and returns the results as a list of dictionaries.
    /// Each dictionary represents a row with column names as keys.
    /// </summary>
    /// <param name="connection">An open <see cref="SqliteConnection"/>.</param>
    /// <param name="sql">The SQL query to execute.</param>
    /// <returns>A list of dictionaries representing the query result rows.</returns>
    public static List<Dictionary<string, object>> ExecuteQuery(SqliteConnection connection, string sql)
    {
        var results = new List<Dictionary<string, object>>();
        using var command = connection.CreateCommand();
        command.CommandText = sql;
        using var reader = command.ExecuteReader();
        while (reader.Read())
        {
            var row = new Dictionary<string, object>();
            for (int i = 0; i < reader.FieldCount; i++)
            {
                // If the field is DBNull, assign null.
                row[reader.GetName(i)] = reader.IsDBNull(i) ? null : reader.GetValue(i);
            }
            results.Add(row);
        }
        return results;
    }

    /// <summary>
    /// Executes a SQL statement that creates a new table or view.
    /// </summary>
    /// <param name="connection">An open SQLite connection.</param>
    /// <param name="createQuery">The CREATE TABLE or CREATE VIEW SQL statement.</param>
    public static void CreateTableOrView(SqliteConnection connection, string createQuery)
    {
        using var command = connection.CreateCommand();
        command.CommandText = createQuery;
        command.ExecuteNonQuery();
    }

    public static void DropTableOrView(SqliteConnection connection, string objectName, string objectType = "table")
    {
        string sql;
        if (objectType.Equals("table", StringComparison.OrdinalIgnoreCase))
        {
            sql = $"DROP TABLE IF EXISTS [{objectName}];";
        }
        else if (objectType.Equals("view", StringComparison.OrdinalIgnoreCase))
        {
            sql = $"DROP VIEW IF EXISTS [{objectName}];";
        }
        else
        {
            throw new ArgumentException("objectType must be either 'table' or 'view'");
        }
        
        using var command = connection.CreateCommand();
        command.CommandText = sql;
        command.ExecuteNonQuery();
    }

    public static void UpdateTableRow(
    SqliteConnection connection, 
    string tableName, 
    string keyColumn, 
    object keyValue, 
    Dictionary<string, object> updatedValues)
    {
        // Build the SET clause from the dictionary.
        var setClauses = updatedValues.Select((kv, i) => $"[{kv.Key}] = @p{i}").ToList();
        string setClause = string.Join(", ", setClauses);
        string sql = $"UPDATE [{tableName}] SET {setClause} WHERE [{keyColumn}] = @keyValue;";
        
        using var command = connection.CreateCommand();
        command.CommandText = sql;
        
        // Add parameters for updated columns.
        int index = 0;
        foreach (var kv in updatedValues)
        {
            command.Parameters.AddWithValue($"@p{index}", kv.Value);
            index++;
        }
        // Add the key value parameter.
        command.Parameters.AddWithValue("@keyValue", keyValue);
        
        command.ExecuteNonQuery();
    }
}

### GetCountryFromCoordinatesAsync

Used to translate the longtitude and latitude to country.

In [None]:
using System;
using System.Collections.Generic;
using System.Net.Http;
using System.Text.Json;
using System.Threading.Tasks;

public async Task<string> GetCountryFromCoordinatesAsync(double latitude, double longitude)
{
    try
    {
        using HttpClient client = new HttpClient();
        client.DefaultRequestHeaders.UserAgent.ParseAdd("MyAppName/1.0 (yoavdo@gmail.com)");
        // This URL is an example using the free OpenStreetMap Nominatim service.
        // Check usage policies and consider caching for production.
        string url = $"https://nominatim.openstreetmap.org/reverse?format=jsonv2&lat={latitude}&lon={longitude}";
        HttpResponseMessage response = await client.GetAsync(url);
        response.EnsureSuccessStatusCode();
        string jsonResponse = await response.Content.ReadAsStringAsync();
        using JsonDocument doc = JsonDocument.Parse(jsonResponse);
        if (doc.RootElement.TryGetProperty("address", out JsonElement address))
        {
            // Console.WriteLine(address);
            if (address.TryGetProperty("country_code", out JsonElement country_code))
            {
                return country_code.GetString();
            }
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Error reverse geocoding: {ex.Message}");
    }
    return "N/A!";
}

In [None]:
var location = await GetCountryFromCoordinatesAsync(37.7749, -122.4194);
Console.WriteLine(location);

In [None]:
var connection = SQLiteHelper.LoadDatabase("fy25-raw.db");
string query = "SELECT * from SurveyResponses WHERE ResponseId = 'R_105HrhmcqZ4Edep';";
// string selectQuery = "SELECT * FROM where Country = 'United States';";
var results = SQLiteHelper.ExecuteQuery(connection, query);
// Print the results all the fields

foreach (var row in results)
{
    foreach (var kvp in row)
    {
        Console.WriteLine($"{kvp.Key}: {kvp.Value}");
    }
    Console.WriteLine();
}



## query the data

In [None]:
var dbhelper = SQLiteHelper.LoadDatabase("fy25-raw.db");
var query = @"select count(*) from SurveyResponses
              where Q026a_8 = 'Very satisfied';";
var results = SQLiteHelper.ExecuteQuery(dbhelper, query);
foreach (var row in results)
{
    Console.WriteLine(string.Join(", ", row.Select(kvp => $"{kvp.Key}={kvp.Value}")));
}

query = @"select count(*) from MyView
         where Q026_19 = 'Very satisfied';";
results = SQLiteHelper.ExecuteQuery(dbhelper, query);
foreach (var row in results)
{
    Console.WriteLine(string.Join(", ", row.Select(kvp => $"{kvp.Key}={kvp.Value}")));
}

## Data manipulation & exploration

As there are few fields which are calculated, this shows how it can be done. The creation of a new table with the ResponseID is used to allow later join with this data.
It also shows how to do that for other calculations. 

In [None]:
// select 10 items from the table above
var connection = SQLiteHelper.LoadDatabase("../fy25-raw.db");

var tableName  = "Locations";

string createTableQuery = $@"
    CREATE TABLE IF NOT EXISTS {tableName} AS
    SELECT ResponseId, 
           LocationLatitude, 
           LocationLongitude, 
           '' AS Country
    FROM SurveyResponses;
";
SQLiteHelper.ExecuteNonQuery(connection, createTableQuery);

// update only 10 items (the api might throttle us)
string selectQuery = $@"
    SELECT * FROM {tableName} 
    WHERE Country = ''
    LIMIT 10;";
var results = SQLiteHelper.ExecuteQuery(connection, selectQuery);

foreach (var row in results)
{
    double latitude = Convert.ToDouble(row["LocationLatitude"]);
    double longitude = Convert.ToDouble(row["LocationLongitude"]);
    Console.WriteLine($"Latitude: {latitude}, Longitude: {longitude}");
    string country = await GetCountryFromCoordinatesAsync(latitude, longitude);
    
    // Update the in-memory dictionary (for display, if needed)
    row["Country"] = country.ToUpper();
    Console.WriteLine($"Country: {country.ToUpper()}");
    
    // Update the corresponding row in the database using ResponseId as the key.
    var updatedValues = new Dictionary<string, object>
    {
        { "Country", country.ToUpper() }
    };
    SQLiteHelper.UpdateTableRow(connection, tableName, "ResponseId", row["ResponseId"], updatedValues);
}

Console.WriteLine("Rows updated successfully.");

In [None]:
var connection = SQLiteHelper.LoadDatabase("../fy25-raw.db");

var tableName  = "Locations";

string selectQuery = $@"
    SELECT count(*) FROM {tableName} 
    WHERE Country != '';";
    
var results = SQLiteHelper.ExecuteQuery(connection, selectQuery);

// print the first 10 rows
foreach (var row in results.Take(10))
{
    Console.WriteLine(string.Join(", ", row.Select(kvp => $"{kvp.Key}={kvp.Value}")));
}

## Mapping raw data to deep-dive data

This is an initial take on the mapping between the raw fields and the deep dive area. o1 was used here to try and match, while this can be done without discussing the deep dive area with the person designing it, to make progress, we will pick and choose specific columns to show the ability of enriching the data.

This would be part of the silver notebook.



| **Target Field**                                        | **Likely Source Field(s)**                                                | **Operation or Mapping Logic**                                                                                                                      |
|:--------------------------------------------------------|:--------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------|
| **Country**                                             | `LocationLatitude`, `LocationLongitude`                                   | **Reverse Geocoding**<br/>Use latitude & longitude to determine the respondent’s country.                                                           |
| **ResponseId**                                          | `ResponseId`                                                             | **Direct Copy**<br/>Map source `ResponseId` to target.                                                                                              |
| **CSAT_Azure**                                          | Possibly in `Q011_1` (or similar)                                         | **Parsing Satisfaction for Azure**<br/>Identify which source column has Azure satisfaction (“Very satisfied,” etc.) and map directly.               |
| **CSAT_AWS**                                            | Possibly in `Q011_2` (or similar)                                         | **Parsing Satisfaction for AWS**<br/>Identify which source column has AWS satisfaction (“Very satisfied,” etc.) and map directly.                   |
| **CSAT_GCP**                                            | Possibly in `Q011_3` (or similar)                                         | **Parsing Satisfaction for GCP**<br/>Identify which source column has GCP satisfaction (“Very satisfied,” etc.) and map directly.                   |
| **USE PARTNER**                                         | Look for fields/text referencing partner usage (e.g., “Uses a Partner”).  | **Boolean / Text Check**<br/>If question indicates “Uses a Partner” vs. “Does not use a Partner,” map accordingly.                                   |
| **Partner_Brand**                                       | Same partner usage question(s)                                            | **Conditional**<br/>If “Uses a Partner,” parse which brand is used (e.g., “Partner - AWS,” “Partner - Azure,” etc.).                                |
| **NPS_Azure** (0–10 rating)                             | Possibly columns with “10 - Extremely likely…” or numeric scale for Azure | **Numeric/Verbatim Copy**<br/>Find the Azure recommend-likelihood question/column (often a 0–10 scale).                                              |
| **NPS_AWS**                                             | Same logic for AWS                                                        | **Numeric/Verbatim Copy**                                                                                                                           |
| **NPS_GCP**                                             | Same logic for GCP                                                        | **Numeric/Verbatim Copy**                                                                                                                           |
| **Operating System**                                    | Possibly columns for Windows vs. Linux usage                              | **Categorical Mapping**<br/>Map from OS usage responses (e.g., “Windows only,” “Linux only,” “Mixed”).                                              |
| **Servers_New_Azure (>60% usage)**                     | Detailed usage columns (e.g., Azure usage %)                               | **Percent / Usage Calculation**<br/>Check if usage percentage for Azure > 60% and categorize.                                                        |
| **Servers_New_Azure (>80% usage)**                     | Same as above, but check for >80% usage                                    | **Percent / Usage Calculation**<br/>If usage > 80%, map as “Azure - Linux Heavy Users(>80%)” (or similarly “Windows Heavy,” etc.).                  |
| **Servers_New_AWS (>60% usage)**                       | Detailed usage columns (AWS)                                              | **Percent / Usage Calculation**                                                                                                                     |
| **Servers_New_AWS (>80% usage)**                       | Detailed usage columns (AWS)                                              | **Percent / Usage Calculation**                                                                                                                     |
| **Servers_New_GCP (>60% usage)**                       | Detailed usage columns (GCP)                                              | **Percent / Usage Calculation**                                                                                                                     |
| **Servers_New_GCP (>80% usage)**                       | Detailed usage columns (GCP)                                              | **Percent / Usage Calculation**                                                                                                                     |
| **Tenure with Cloud Option 3**                         | Possibly columns that say “Less than 3 months,” “1–2 years,” etc.         | **Categorical Copy**<br/>Map the overall cloud usage tenure to the target’s text format (“6+ years,” “2–3 years,” etc.).                           |
| **Tenure with Azure Option 3**                         | Brand-specific tenure columns for Azure                                   | **Categorical Copy**<br/>Map brand-specific usage tenure text.                                                                                      |
| **Tenure with AWS Option 3**                           | Brand-specific tenure columns for AWS                                     | **Categorical Copy**                                                                                                                               |
| **Tenure with GCP Option 3**                           | Brand-specific tenure columns for GCP                                     | **Categorical Copy**                                                                                                                               |
| **Cloud Usage**                                        | Columns indicating multi or single cloud usage                            | **Logic Based on Selection**<br/>If more than one cloud is “Currently use,” it’s “Multicloud”; otherwise “Single Cloud.”                            |
| **Multi-cloud Users**                                  | Same as above                                                             | **Boolean Check**<br/>If using multiple clouds, “Multi-cloud” = True.                                                                              |
| **Segment**                                            | Possibly a question about SMB vs. ENT vs. Mid-market                      | **Categorical Copy**<br/>Map directly from the segment question.                                                                                    |
| **Org Size (R+I split)**                               | A question or column referencing total employees                          | **Bucket by Employee Count**<br/>Map numeric size ranges (e.g., 1–24, 25–249, 1000+) to labels.                                                     |
| **Industry**                                           | Possibly in `Q005` or “Which industry best fits?”                         | **Categorical Copy**<br/>Map to “IT & Other,” “Education,” etc.                                                                                    |
| **Role**                                               | Columns about job role (e.g., “Developer,” “IT Pro,” etc.)               | **Categorical Copy**<br/>Based on job-role question.                                                                                               |
| **Cloud Native (Q101b, Q101a)**                        | Possibly 1–2 columns about “Cloud Native approach”                        | **Derived**<br/>If certain conditions met (e.g., both Q101a and Q101b = “Yes”), mark “Cloud Native.”                                               |
| **Startup (Q102a)**                                    | A question: “Is your company a startup?”                                  | **Boolean**                                                                                                                                        |
| **Q089a_2 (ISV)**                                      | Question about ISV status                                                | **Boolean**                                                                                                                                        |
| **Q048b (Customer Support plan)**                      | Columns referencing “Business Support,” “Standard Support,” etc.         | **Categorical Copy**                                                                                                                               |
| **BrandAssigned**                                      | Might be logic about the respondent’s assigned brand                     | **Conditional**<br/>Based on screening/branching logic (e.g., “AWS,” “Azure,” or “Google Cloud”).                                                  |