# Week 4 - Univariate Analysis, part 2

# 1. Lesson - None

# 2. Weekly graph question

Below are a histogram and boxplot representation of the same data. A pharmacy is keeping a record of the prices of the drugs that it sells, and an administrator wants to know how much the more expensive drugs tend to cost, in the context of the other prices.

Please write a short explanation of the pros and cons of these two representations. Which would you choose? How would you modify the formatting, if at all, to make it more visually interesting, clear, or informative?

In [None]:
import numpy as np
import pandas as pd

np.random.seed(0)
num_data = 100
data = np.exp(np.random.uniform(size = num_data) * 4)
df = pd.DataFrame(data.T, columns = ["data"])

In [None]:
print("The 75th percentile is:", df.quantile(q = 0.75))
df.plot.hist()

In [None]:
df.plot.box()

I don't find either plot particularly engaging for answering the original question. Both visuals lack significant context without titles, axes, or visuals that clearly highlight high priced drugs.I find the histogram slightly better as you can clearly see the long tail of expensive drugs and the high frequency of drugs under $15. The box plot does show the outliers but they overlap and it's difficult to see the frequency across the population.

# 3. Homework - working on your datasets

This week, you will do the same types of exercises as last week, but you should use your chosen datasets that someone in your class found last semester. (They likely will not be the particular datasets that you found yourself.)

### Here are some types of analysis you can do:

- Draw histograms and histogram variants for each feature or column.  (Swarm plot, kde plot, violin plot).

- Draw grouped histograms.  For instance, if you have tree heights for both maple and oak trees, you could draw histograms for both.

- Draw a bar plot to indicate total counts of each categorical variable in a given column.

- Find means, medians, and modes.

### Conclusions:

- Explain what conclusions you would draw from this analysis: are the data what you expect?  Are the data likely to be usable?  If they are not useable, find some new data!

- What is the overall shape of the distribution?  Is it normal, skewed, bimodal, uniform, etc.?

- Are there any outliers present?  (Data points that are far from the others.)

- If there are multiple related histograms, how does the distribution change across different groups?

- What are the minimum and maximum values represented in each histogram?

- How do bin sizes affect the histogram?  Does changing the bin width reveal different patterns in the data?

- Does the distribution appear normal, or does it have a different distribution?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

In [None]:
'''ALL MY FUNCTIONS'''

# Function to calculate summary stats
def calculate_statistics(column):
    mean = df[column].mean()
    median = df[column].median()
    mode = df[column].mode().values[0] if not df[column].mode().empty else np.nan
    std_dev = df[column].std()
    variance = df[column].var()
    minimum = df[column].min()
    maximum = df[column].max()
    value_range = maximum - minimum
    
    return {
        'mean': mean,
        'median': median,
        'mode': mode,
        'std_dev': std_dev,
        'variance': variance,
        'min': minimum,
        'max': maximum,
        'range': value_range
    }


def generate_plots(df):
    # Iterate over each numeric column
    for column in df.select_dtypes(include=[np.number]).columns:
        stats = calculate_statistics(column)
        print(f"Statistics for {column}:")
        print(f"Mean: {stats['mean']:.1f}")
        print(f"Median: {stats['median']:.1f}")
        print(f"Mode: {stats['mode']:.1f}")
        print(f"Standard Deviation: {stats['std_dev']:.1f}")
        print(f"Variance: {stats['variance']:.1f}")
        print(f"Minimum: {stats['min']:.1f}")
        print(f"Maximum: {stats['max']:.1f}")
        print(f"Range: {stats['range']:.1f}\n")

        
        # Histogram
        plt.figure(figsize=(10, 6))
        sns.histplot(df[column], kde=False)
        plt.title(f'Histogram of {column}')
        plt.xlabel(column)
        plt.ylabel('Frequency')
        plt.show()
        
        # KDE plot
        plt.figure(figsize=(10, 6))
        sns.kdeplot(df[column], shade=True)
        plt.title(f'KDE Plot of {column}')
        plt.xlabel(column)
        plt.ylabel('Density')
        plt.show()
        
        # Violin plot
        plt.figure(figsize=(10, 6))
        sns.violinplot(x=df[column])
        plt.title(f'Violin Plot of {column}')
        plt.xlabel(column)
        plt.show()
        
        # Swarm plot
        plt.figure(figsize=(10, 6))
        sns.swarmplot(x=df[column].sample(n=500))
        plt.title(f'Swarm Plot of {column}')
        plt.xlabel(column)
        plt.show()





In [None]:
'''PRICELINE ANALYSIS'''
df = pd.read_csv("priceline_clean.csv")
generate_plots(df)

In [None]:
'''DELAY DATA ANALYSIS'''
df = pd.read_csv("delay_data_clean.csv")
generate_plots(df)

In [None]:
'''USDOT ON-TIME DATA ANALYSIS'''
df = pd.read_csv("usdot_onetime_clean.csv")
generate_plots(df)

# 4. Storytelling With Data graph

Reproduce any graph of your choice in p. 52-68 of the Storytelling With Data book as best you can.  (The second half of chapter two).  You do not have to get the exact data values right, just the overall look and feel.