# üìä Homework 1: Introduction to Data Processing
**MIS 769 - Big Data Analytics for Business | Spring 2026**

**Points:** 20 | **Due:** Sunday, February 2, 2026 @ 11pm Pacific

**Author:** Richard Young, Ph.D. | UNLV Lee Business School

**Compute:** CPU (free tier)

---

## What You'll Learn

1. Connect Google Colab to external data sources (Kaggle/HuggingFace)
2. Load and explore large datasets with pandas
3. Perform **Data Quality Assessment** (missing values, duplicates, outliers)
4. **Find Something Interesting** in your data

---

## The Data Science Pipeline

The diagram below illustrates the standard **data processing pipeline** you'll work through in this homework:

data_pipeline.svg

**Key Pipeline Stages:**
1. **Acquisition** - Loading data from HuggingFace/Kaggle (Part 2)
2. **Preprocessing** - Cleaning: missing values, duplicates, outliers (Part 3)
3. **Feature Extraction** - Understanding your data's structure (Part 3-4)
4. **Modeling** - Finding patterns in your data (Part 4)
5. **Interpretation** - Drawing meaningful conclusions

This same pipeline applies whether you're analyzing business data, neural signals, or training ML models.

---

## Part 1: Environment Setup (3 points)

First, let's install the libraries we need and verify everything works.

In [None]:
# Install required packages
!pip install datasets pandas numpy matplotlib seaborn -q

print("‚úÖ Packages installed successfully!")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)

print("‚úÖ Libraries imported!")
print(f"   Pandas version: {pd.__version__}")
print(f"   NumPy version: {np.__version__}")

---

## Part 2: Load Your Data (4 points)

Choose **ONE** of the following data sources. HuggingFace is recommended for beginners (no login required).

### Option A: HuggingFace Datasets (Easiest - No Login Required)

In [None]:
# OPTION A: Load from HuggingFace (RECOMMENDED)
from datasets import load_dataset

# Choose ONE dataset by uncommenting:

# NVIDIA HelpSteer2 - AI Response Quality Ratings (~21k rows)
dataset = load_dataset("nvidia/HelpSteer2", split="train")

# OR: Movie Reviews - IMDB
# dataset = load_dataset("stanfordnlp/imdb", split="train")

# OR: Yelp Reviews
# dataset = load_dataset("Yelp/yelp_review_full", split="train[:50000]")

# Convert to pandas DataFrame
df = dataset.to_pandas()

print(f"‚úÖ Loaded {len(df):,} records from HuggingFace")
print(f"   Columns: {list(df.columns)}")

### Option B: Kaggle Datasets (More Variety - Requires API Key)

If you want to use Kaggle, you'll need to:
1. Create a Kaggle account at kaggle.com
2. Go to Settings ‚Üí API ‚Üí Create New Token
3. Upload the `kaggle.json` file when prompted below

In [None]:
# OPTION B: Load from Kaggle (uncomment to use)

# # Step 1: Set up Kaggle credentials
# !pip install kaggle -q
# from google.colab import files
# print("Upload your kaggle.json file:")
# files.upload()  # Upload kaggle.json

# !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

# # Step 2: Download a dataset (choose one)
# # Spotify Tracks:
# !kaggle datasets download -d maharshipandya/-spotify-tracks-dataset -p /content --unzip
# df = pd.read_csv('/content/dataset.csv')

# # OR: Netflix Titles:
# # !kaggle datasets download -d shivamb/netflix-shows -p /content --unzip
# # df = pd.read_csv('/content/netflix_titles.csv')

# print(f"‚úÖ Loaded {len(df):,} records from Kaggle")

### Verify Your Data

Let's take a first look at the data.

In [None]:
# Basic info about your dataset
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Number of rows: {len(df):,}")
print(f"Number of columns: {len(df.columns)}")
print(f"\nColumn names: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

In [None]:
# Preview first few rows
df.head()

---

## Part 3: Data Quality Assessment (8 points)

A critical skill for any data professional is assessing data quality BEFORE analysis. Let's check for common issues.

### 3.1 Missing Values

In [None]:
# Check for missing values
print("=" * 60)
print("MISSING VALUES ANALYSIS")
print("=" * 60)

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Column': missing.index,
    'Missing Count': missing.values,
    'Missing %': missing_pct.values
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing %', ascending=False)

if len(missing_df) > 0:
    print("\n‚ö†Ô∏è Columns with missing values:")
    print(missing_df.to_string(index=False))
else:
    print("\n‚úÖ No missing values found!")

print(f"\nTotal cells: {df.size:,}")
print(f"Missing cells: {df.isnull().sum().sum():,}")
print(f"Completeness: {(1 - df.isnull().sum().sum() / df.size) * 100:.2f}%")

### 3.2 Duplicate Records

In [None]:
# Check for duplicate rows
print("=" * 60)
print("DUPLICATE RECORDS ANALYSIS")
print("=" * 60)

duplicates = df.duplicated().sum()
duplicate_pct = (duplicates / len(df) * 100)

print(f"\nTotal rows: {len(df):,}")
print(f"Duplicate rows: {duplicates:,}")
print(f"Duplicate percentage: {duplicate_pct:.2f}%")

if duplicates > 0:
    print("\n‚ö†Ô∏è Sample duplicate rows:")
    print(df[df.duplicated(keep=False)].head())
else:
    print("\n‚úÖ No duplicate rows found!")

### 3.3 Outliers (for numeric columns)

In [None]:
# Identify numeric columns and check for outliers
print("=" * 60)
print("OUTLIER ANALYSIS (Numeric Columns)")
print("=" * 60)

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"\nNumeric columns found: {numeric_cols}")

if len(numeric_cols) > 0:
    for col in numeric_cols[:5]:  # Limit to first 5 numeric columns
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
        outlier_pct = len(outliers) / len(df) * 100
        
        print(f"\nüìä {col}:")
        print(f"   Range: {df[col].min():.2f} to {df[col].max():.2f}")
        print(f"   Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
        print(f"   Outliers: {len(outliers):,} ({outlier_pct:.1f}%)")
else:
    print("\nNo numeric columns found for outlier analysis.")

### 3.4 Data Quality Summary Visualization

In [None]:
# Create a visual summary
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Missing values heatmap
ax1 = axes[0]
missing_matrix = df.isnull().sum().values.reshape(1, -1)
sns.heatmap(missing_matrix, annot=True, fmt='d', cmap='YlOrRd', 
            xticklabels=df.columns, yticklabels=['Missing'], ax=ax1, cbar=False)
ax1.set_title('Missing Values by Column', fontsize=12)
ax1.tick_params(axis='x', rotation=45)

# Data types distribution
ax2 = axes[1]
dtype_counts = df.dtypes.astype(str).value_counts()
dtype_counts.plot(kind='bar', ax=ax2, color=['steelblue', 'coral', 'green', 'purple'][:len(dtype_counts)])
ax2.set_title('Column Data Types', fontsize=12)
ax2.set_xlabel('Data Type')
ax2.set_ylabel('Count')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n‚úÖ Data Quality Assessment Complete!")

---

## Part 4: Find Something Interesting! (5 points)

Now it's YOUR turn to explore. Find something interesting, surprising, or useful in your data.

**Ideas to explore:**
- What's the distribution of a key variable?
- Are there any unexpected patterns?
- What correlations exist between variables?
- What's the most/least common category?

In [None]:
# YOUR EXPLORATION CODE HERE
# Example: Distribution of a text column's length

# Find text columns
text_cols = df.select_dtypes(include=['object']).columns.tolist()
if len(text_cols) > 0:
    text_col = text_cols[0]  # Use first text column
    df['text_length'] = df[text_col].astype(str).str.len()
    
    print(f"üìä Text Length Analysis for '{text_col}':")
    print(f"   Shortest: {df['text_length'].min()} characters")
    print(f"   Longest: {df['text_length'].max()} characters")
    print(f"   Average: {df['text_length'].mean():.0f} characters")
    
    plt.figure(figsize=(10, 4))
    plt.hist(df['text_length'], bins=50, edgecolor='black', alpha=0.7)
    plt.xlabel('Text Length (characters)')
    plt.ylabel('Frequency')
    plt.title(f'Distribution of Text Length in {text_col}')
    plt.axvline(df['text_length'].mean(), color='red', linestyle='--', label=f'Mean: {df["text_length"].mean():.0f}')
    plt.legend()
    plt.show()

In [None]:
# ADD YOUR OWN INTERESTING FINDING HERE!
# What pattern, insight, or surprise did you discover?

# Example template:
print("üîç MY INTERESTING FINDING:")
print("="*50)
print("""Describe what you found here...

- What did you discover?
- Why is it interesting or surprising?
- What business question could this help answer?
""")

# Your analysis code below:
# ...

---

## Submission Checklist

Before submitting, verify you have completed:

| Item | Points | Done? |
|------|--------|-------|
| Part 1: Environment setup works | 3 | ‚òê |
| Part 2: Data loaded successfully | 4 | ‚òê |
| Part 3: Data quality assessment (missing, duplicates, outliers) | 8 | ‚òê |
| Part 4: Found something interesting with explanation | 5 | ‚òê |
| **Total** | **20** | |

---

## How to Submit

1. **Run all cells** (Runtime ‚Üí Run all)
2. **Save the notebook** (File ‚Üí Save)
3. **Download as .ipynb** (File ‚Üí Download ‚Üí Download .ipynb)
4. **Upload to Canvas** under HW1 assignment

---

## Resources

- [HuggingFace Datasets Documentation](https://huggingface.co/docs/datasets/)
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Data Quality Best Practices](https://www.ibm.com/topics/data-quality)