<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-2--online-ad-clicks/05_case_study_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Case study 2 solution

We’ve been asked to analyze the online ad-click data collected by our buddy Fred.
His advertising data table monitors ad clicks across 30 different colors. Our aim is
to discover an ad color that generates significantly more clicks than blue.

We will do so by following these steps:
* Load and clean our advertising data using Pandas.
* Run a permutation test between blue and the other recorded colors.
* Check the computed p-values for statistical significance using a properly
determined significance level.

##Setup

In [2]:
from collections import defaultdict
import itertools

import numpy as np
import pandas as pd
from scipy import stats
import math

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
!wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-2--online-ad-clicks/colored_ad_click_table.csv

In [3]:
# Penalizing centers using the squared distance from the minimum
def squared_distance(value1, value2):
  return (value1 - value2) ** 2

In [4]:
# Penalizing centers using the total sum of squared distances
def sum_of_squared_distances(value, measurements):
  return sum(squared_distance(value, m) for m in measurements)

In [5]:
# Computing California’s sum of squares
def sum_of_squares(data):
  mean = np.mean(data)
  return sum(squared_distance(value, mean) for value in data)

In [6]:
# Computing the variance from mean squared distance
def variance(data):
  mean = np.mean(data)
  return np.mean([squared_distance(value, mean) for value in data])

In [7]:
# Computing the weighted variance using np.average
def weighted_variance(data, weights):
  mean = np.average(data, weights=weights)
  squared_distances = [squared_distance(value, mean) for value in data]
  return np.average(squared_distances, weights=weights)

assert weighted_variance([75, 77], [9, 1]) == np.var(9 * [75] + [77])

##Loading dataset

Let’s begin by loading our ad-click table into Pandas.

In [8]:
# Loading the ad-click table into Pandas
df = pd.read_csv("colored_ad_click_table.csv")
print(f"Table contains {df.shape[0]} rows and {df.shape[1]} columns")

Table contains 30 rows and 41 columns


In [9]:
# Checking the column names
print(df.columns)

Index(['Color', 'Click Count: Day 1', 'View Count: Day 1',
       'Click Count: Day 2', 'View Count: Day 2', 'Click Count: Day 3',
       'View Count: Day 3', 'Click Count: Day 4', 'View Count: Day 4',
       'Click Count: Day 5', 'View Count: Day 5', 'Click Count: Day 6',
       'View Count: Day 6', 'Click Count: Day 7', 'View Count: Day 7',
       'Click Count: Day 8', 'View Count: Day 8', 'Click Count: Day 9',
       'View Count: Day 9', 'Click Count: Day 10', 'View Count: Day 10',
       'Click Count: Day 11', 'View Count: Day 11', 'Click Count: Day 12',
       'View Count: Day 12', 'Click Count: Day 13', 'View Count: Day 13',
       'Click Count: Day 14', 'View Count: Day 14', 'Click Count: Day 15',
       'View Count: Day 15', 'Click Count: Day 16', 'View Count: Day 16',
       'Click Count: Day 17', 'View Count: Day 17', 'Click Count: Day 18',
       'View Count: Day 18', 'Click Count: Day 19', 'View Count: Day 19',
       'Click Count: Day 20', 'View Count: Day 20'],
      dtyp

In [10]:
# Checking the color names
print(df.Color.values)

['Pink' 'Gray' 'Sapphire' 'Purple' 'Coral' 'Olive' 'Navy' 'Maroon' 'Teal'
 'Cyan' 'Orange' 'Black' 'Tan' 'Red' 'Blue' 'Brown' 'Turquoise' 'Indigo'
 'Gold' 'Jade' 'Ultramarine' 'Yellow' 'Virdian' 'Violet' 'Green'
 'Aquamarine' 'Magenta' 'Silver' 'Bronze' 'Lime']


In [11]:
# Checking for blue
assert "Blue" in df.Color.values

In [12]:
# Summarizing day 1 of the experiment
selected_columns = ["Color", "Click Count: Day 1", "View Count: Day 1"]
print(df[selected_columns].describe())

       Click Count: Day 1  View Count: Day 1
count           30.000000               30.0
mean            23.533333              100.0
std              7.454382                0.0
min             12.000000              100.0
25%             19.250000              100.0
50%             24.000000              100.0
75%             26.750000              100.0
max             49.000000              100.0


In [13]:
# Summarizing day 2 of the experiment
selected_columns = ["Color", "Click Count: Day 2", "View Count: Day 2"]
print(df[selected_columns].describe())

       Click Count: Day 2  View Count: Day 2
count           30.000000               30.0
mean            24.433333              100.0
std              5.864465                0.0
min             15.000000              100.0
25%             21.000000              100.0
50%             24.000000              100.0
75%             28.000000              100.0
max             41.000000              100.0


In [14]:
# Confirming equivalent daily views
view_columns = [col for col in df.columns if "View" in col]
assert np.all(df[view_columns].values == 100)

In [15]:
# Deleting view counts from the table
df.drop(columns=view_columns, inplace=True)
print(df.columns)

Index(['Color', 'Click Count: Day 1', 'Click Count: Day 2',
       'Click Count: Day 3', 'Click Count: Day 4', 'Click Count: Day 5',
       'Click Count: Day 6', 'Click Count: Day 7', 'Click Count: Day 8',
       'Click Count: Day 9', 'Click Count: Day 10', 'Click Count: Day 11',
       'Click Count: Day 12', 'Click Count: Day 13', 'Click Count: Day 14',
       'Click Count: Day 15', 'Click Count: Day 16', 'Click Count: Day 17',
       'Click Count: Day 18', 'Click Count: Day 19', 'Click Count: Day 20'],
      dtype='object')


In [16]:
# Summarizing daily blue-click statistics
df.set_index("Color", inplace=True)
print(df.T.Blue.describe())

count    20.000000
mean     28.350000
std       5.499043
min      18.000000
25%      25.750000
50%      27.500000
75%      30.250000
max      42.000000
Name: Blue, dtype: float64


##Computing p-values from differences in means

Let’s turn our attention to retrieving individual columns, which can be accessed
using their column names.

In [None]:
# Accessing all column names
print(df.columns)

Index(['Fish', 'Length', 'Width'], dtype='object')


In [None]:
# Accessing an individual column
print(df.Fish)

0    Angelfish
1    Zebrafish
2    Killifish
3    Swordtail
Name: Fish, dtype: object


In [None]:
print(df.Length)

0    15.2
1     6.5
2     9.0
3     6.0
Name: Length, dtype: float64


In [None]:
# Retrieving a column as a NumPy array
print(df.Fish.values)

['Angelfish' 'Zebrafish' 'Killifish' 'Swordtail']


In [None]:
# Accessing a column using brackets
print(df["Fish"])

0    Angelfish
1    Zebrafish
2    Killifish
3    Swordtail
Name: Fish, dtype: object


In [None]:
# Accessing multiple columns using brackets
print(df[["Fish", "Length"]])

        Fish  Length
0  Angelfish    15.2
1  Zebrafish     6.5
2  Killifish     9.0
3  Swordtail     6.0


In [None]:
# Sorting rows by column value
print(df.sort_values("Length"))

        Fish  Length  Width
3  Swordtail     6.0    2.0
1  Zebrafish     6.5    2.1
2  Killifish     9.0    4.5
0  Angelfish    15.2    7.7


In [None]:
# Filtering rows by column value
print(df[df.Width >= 3])

        Fish  Length  Width
0  Angelfish    15.2    7.7
2  Killifish     9.0    4.5
