https://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html


although the mean and standard deviation may suggest a unimodal, symmetric distribution, a histogram reveals two distinct peaks—critical information missed by summary stats alone.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate a bimodal dataset
data = np.concatenate([np.random.normal(5, 1, 1000), np.random.normal(15, 1, 1000)])
df = pd.DataFrame({'value': data})

# Summary statistics
print(df.describe())

# Histogram
plt.figure(figsize=(8, 4))
sns.histplot(df['value'], kde=True, bins=30)
plt.title("Histogram with KDE - Bimodal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

This boxplot helps identify extreme values. Rather than immediately removing them, review the data source or context—outliers may signal important anomalies like sensor errors or exceptional behavior.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate a dataset with outliers
data = np.append(np.random.normal(50, 10, 100), [5, 150])
df = pd.DataFrame({'value': data})

# Calculate IQR and determine outliers
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Flag outliers
df['is_outlier'] = (df['value'] < lower_bound) | (df['value'] > upper_bound)

# Visualize with boxplot
plt.figure(figsize=(6, 4))
sns.boxplot(x=df['value'])
plt.title("Boxplot with Outliers Highlighted")
plt.grid(True)
plt.show()

This example illustrates how to report not only a point estimate (the mean) but also the uncertainty around it—helping stakeholders understand the precision of your analysis.

In [0]:
import numpy as np
import scipy.stats as stats

# Sample data
np.random.seed(42)
data = np.random.normal(loc=100, scale=15, size=200)

# Sample mean and standard error
mean = np.mean(data)
sem = stats.sem(data)

# 95% confidence interval
confidence = 0.95
margin_of_error = sem * stats.t.ppf((1 + confidence) / 2., len(data)-1)
ci_lower = mean - margin_of_error
ci_upper = mean + margin_of_error

print(f"Mean: {mean:.2f}")
print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")

This example highlights how looking directly at raw values can reveal issues like invalid timestamps or incorrect data types that could silently break your downstream logic if unchecked.

In [0]:
import pandas as pd

# Load sample data (simulate log data)
data = {
    'timestamp': ['2024-01-01 12:00:00', '2024-01-01 12:01:00', 'not_a_timestamp'],
    'event': ['login', 'click', None],
    'user_id': [123, 124, 'abc']
}
df = pd.DataFrame(data)

# Display raw data
print("Sample Raw Data:")
print(df.head())

# Check for parsing issues
print("Data Types:")
print(df.dtypes)

# Attempt to convert timestamp
df['timestamp_converted'] = pd.to_datetime(df['timestamp'], errors='coerce')
print("With Converted Timestamps:")
print(df[['timestamp', 'timestamp_converted']])

This simple segmentation reveals differences in user behavior between mobile and desktop platforms—insights that might be hidden in aggregate statistics.

In [0]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Simulated user engagement data
data = {
    'device_type': ['mobile'] * 50 + ['desktop'] * 50,
    'session_duration': list(np.random.normal(5, 1.5, 50)) + list(np.random.normal(8, 2, 50))
}
df = pd.DataFrame(data)

# Compare distributions by device type
plt.figure(figsize=(8, 4))
sns.boxplot(x='device_type', y='session_duration', data=df)
plt.title("Session Duration by Device Type")
plt.ylabel("Minutes")
plt.xlabel("Device Type")
plt.grid(True)
plt.show()

**Objective:**
To verify whether all expected data is being correctly logged—this is crucial before relying on the dataset for analysis.

**Step-by-step Breakdown:**
**1. Simulate Log Data**

    `log_data = pd.DataFrame({
        'event_type': ['click', 'view', 'click', 'hover', 'scroll'],
        'timestamp': pd.date_range("2024-01-01", periods=5, freq='T'),
        'element_id': ['btn-1', 'img-2', 'btn-2', None, 'div-3']
    })`

Creates a small DataFrame mimicking logging events on a website. Each row represents a user interaction with metadata:

- event_type: the type of user action
- timestamp: when it occurred
- element_id: the webpage element interacted with

**2. Check for Missing Data**

    `missing_elements = log_data[log_data['element_id'].isnull()]`

This filters rows where element_id is None (i.e., not recorded). These might indicate a bug in how the frontend tags or logs interactions.

**3. Validate Event Coverage**

    `expected_events = {'click', 'view', 'scroll', 'hover'}
    logged_events = set(log_data['event_type'].unique())
    missing_events = expected_events - logged_events`

Here, you define the events you expect to find (expected_events) and compare that with what’s actually in the log (logged_events). Any difference means something might be wrong (e.g., a misconfigured tracker not logging one type of event).

**Why It Matters:**
If element_id is missing, you can’t trace user actions accurately.

If some events like hover or scroll are never logged, analyses on those behaviors will be misleading or incomplete.

This kind of basic validation is easy to overlook but critical for trustworthy insights.

In [0]:
import pandas as pd

# Simulated log data
log_data = pd.DataFrame({
    'event_type': ['click', 'view', 'click', 'hover', 'scroll'],
    'timestamp': pd.date_range("2024-01-01", periods=5, freq='T'),
    'element_id': ['btn-1', 'img-2', 'btn-2', None, 'div-3']
})

# Check for missing critical data
missing_elements = log_data[log_data['element_id'].isnull()]
print("Missing element IDs:")
print(missing_elements)

# Check that all expected event types are being logged
expected_events = {'click', 'view', 'scroll', 'hover'}
logged_events = set(log_data['event_type'].unique())
missing_events = expected_events - logged_events
print("Missing expected event types:", missing_events)

Example:
Suppose you're building a dashboard to monitor customer churn. Instead of perfecting data cleaning first, build a quick prototype that connects raw data to key metrics:
This prototype quickly reveals that two customers are likely to churn. Based on stakeholder feedback, you can refine the threshold, add visualizations, and later optimize performance or automate the pipeline.

In [0]:
import pandas as pd

# Simulated data
raw_data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'is_active': [1, 0, 1, 1, 0],
    'last_login_days_ago': [5, 60, 15, 3, 90]
})

# Prototype churn rule
raw_data['likely_churn'] = raw_data['last_login_days_ago'] > 30

# Simple summary
summary = raw_data.groupby('likely_churn')['customer_id'].count()
print("Churn Summary:")
print(summary)

**Example**:
Imagine you run an A/B test on a website feature and find a statistically significant improvement in user engagement:
The result may be statistically significant (e.g., p < 0.05), but a 0.5 minute increase in average session duration may not justify development cost or user experience risk. You must contextualize significance with business impact.

In [0]:
import numpy as np
from scipy import stats

# Simulate engagement durations (in minutes)
a = np.random.normal(10.0, 2.0, 1000)  # group A (control)
b = np.random.normal(10.5, 2.0, 1000)  # group B (treatment)

# t-test for significance
t_stat, p_val = stats.ttest_ind(a, b)
print(f"p-value: {p_val:.4f}")

This example shows how a subtle but meaningful shift in a key metric can be easily spotted with time series monitoring. It's a valuable technique for catching unintended side effects after deployments or system changes.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate time series data
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
data = np.random.normal(loc=100, scale=5, size=100)
data[60:] += 10  # simulate a sudden shift
df = pd.DataFrame({'date': dates, 'metric': data})

# Plot time series
plt.figure(figsize=(10, 4))
plt.plot(df['date'], df['metric'], marker='o', linestyle='-')
plt.axvline(df['date'].iloc[60], color='red', linestyle='--', label='Change Point')
plt.title("Metric Over Time with Simulated Shift")
plt.xlabel("Date")
plt.ylabel("Metric Value")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()