<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-2--online-ad-clicks/04_analyzing_tables_using_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Analyzing tables using Pandas

Data tables are commonly used to store information.

The formatting of a table isn’t
important. What is important is its structure. All tables have structural features in
common: every table contains horizontal rows and vertical columns, and quite
often, column headers also hold explicit column names.

##Setup

In [4]:
from collections import defaultdict
import itertools

import numpy as np
import pandas as pd
from scipy import stats
import math

import matplotlib.pyplot as plt

In [None]:
# Penalizing centers using the squared distance from the minimum
def squared_distance(value1, value2):
  return (value1 - value2) ** 2

In [None]:
# Penalizing centers using the total sum of squared distances
def sum_of_squared_distances(value, measurements):
  return sum(squared_distance(value, m) for m in measurements)

In [None]:
# Computing California’s sum of squares
def sum_of_squares(data):
  mean = np.mean(data)
  return sum(squared_distance(value, mean) for value in data)

In [None]:
# Computing the variance from mean squared distance
def variance(data):
  mean = np.mean(data)
  return np.mean([squared_distance(value, mean) for value in data])

In [None]:
# Computing the weighted variance using np.average
def weighted_variance(data, weights):
  mean = np.average(data, weights=weights)
  squared_distances = [squared_distance(value, mean) for value in data]
  return np.average(squared_distances, weights=weights)

assert weighted_variance([75, 77], [9, 1]) == np.var(9 * [75] + [77])

##Storing tables using basic Python

Let’s define a sample table in Python. The table stores measurements for various species
of fish, in centimeters. 

Our measurement table contains three columns: Fish,
Length, and Width.

In [2]:
# Storing a table using Python data structures
fish_measures = {
    "Fish": ["Angelfish", "Zebrafish", "Killifish", "Swordtail"],
    "Length": [15.2, 6.5, 9, 6],
    "Width": [7.7, 2.1, 4.5, 2]
}

In [3]:
# Accessing table columns using a dictionary
zebrafish_index = fish_measures["Fish"].index("Zebrafish")
zebrafish_length = fish_measures["Length"][zebrafish_index]

print(f"The length of a zebrafish is {zebrafish_length:.2f} cm")

The length of a zebrafish is 6.50 cm


A better solution
is provided by the Pandas library, which is designed for table manipulation.

##Exploring tables using Pandas

In [5]:
# Loading a table into Pandas
df = pd.DataFrame(fish_measures)
print(df)

        Fish  Length  Width
0  Angelfish    15.2    7.7
1  Zebrafish     6.5    2.1
2  Killifish     9.0    4.5
3  Swordtail     6.0    2.0


In [6]:
# Accessing the first two rows of a table
print(df.head(2))

        Fish  Length  Width
0  Angelfish    15.2    7.7
1  Zebrafish     6.5    2.1


In [7]:
# Summarizing the numeric columns
print(df.describe())

          Length     Width
count   4.000000  4.000000
mean    9.175000  4.075000
std     4.225616  2.678775
min     6.000000  2.000000
25%     6.375000  2.075000
50%     7.750000  3.300000
75%    10.550000  5.300000
max    15.200000  7.700000


In [16]:
# Computing the column mean
print(df.mean())

Length    9.175
Width     4.075
dtype: float64


  


In [13]:
# Summarizing the string columns
print(df.describe(include=[object]))

             Fish
count           4
unique          4
top     Angelfish
freq            1


In [17]:
# Retrieving the table as a 2D NumPy array
print(df.values)

[['Angelfish' 15.2 7.7]
 ['Zebrafish' 6.5 2.1]
 ['Killifish' 9.0 4.5]
 ['Swordtail' 6.0 2.0]]


In [18]:
assert type(df.values) == np.ndarray

##Retrieving table columns