# What are Variables

Variables are the fundamental building blocks of data analysis. A variable is any characteristic, attribute, or property that can vary from one observation to another. Understanding variables and their types is crucial for selecting appropriate statistical methods, visualizations, and data encoding techniques in machine learning.

## 1. Definition of Variables

A **variable** is a measurable attribute or characteristic that:
- Can take on different values
- Can be observed, recorded, and measured
- Varies across different subjects or observations

**Examples:**
- Height of students (varies from person to person)
- Gender of individuals (male, female, other)
- Age of employees
- Customer satisfaction rating
- Weather temperature

Variables are essential because they represent the data we collect and analyze. Different types of variables require different statistical treatments.

## 2. Types of Variables

Variables can be broadly classified into two main categories:

### 2.1 Numerical (Quantitative) Variables

Variables that represent quantities and can be measured numerically.

#### **Continuous Variables**
- Can take any value within a range
- Infinite possible values (theoretically)
- Examples: height, weight, temperature, time, distance
- Measured on an interval or ratio scale

#### **Discrete Variables**
- Can only take specific, distinct values
- Countable number of values
- Often whole numbers (integers)
- Examples: number of students, count of defects, number of cars sold

### 2.2 Categorical (Qualitative) Variables

Variables that represent categories or qualities and cannot be measured numerically.

#### **Nominal Variables**
- Categories with no inherent order
- No meaningful ranking
- Examples: color (red, blue, green), gender (male, female), country, brand
- Can only be compared for equality

#### **Ordinal Variables**
- Categories with a natural order or ranking
- Order matters, but differences between categories are not equal
- Examples: education level (high school, bachelor, master, PhD), customer satisfaction (poor, fair, good, excellent), movie ratings (1-5 stars)
- Can be compared for equality and ordered

### 2.3 Visual Summary of Variable Types

```
VARIABLES
├── NUMERICAL (Quantitative)
│   ├── Continuous (e.g., height, weight, temperature)
│   └── Discrete (e.g., number of items, count)
└── CATEGORICAL (Qualitative)
    ├── Nominal (e.g., color, gender, country)
    └── Ordinal (e.g., ranking, rating, satisfaction level)
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample dataset to illustrate different variable types
data = {
    'Student_ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Height_cm': [175.5, 168.3, 180.2, 172.1, 169.8, 182.4, 170.6, 165.9],  # Continuous
    'Number_of_Books': [5, 12, 3, 8, 15, 7, 10, 6],  # Discrete
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],  # Nominal
    'Education_Level': ['Bachelor', 'Master', 'Bachelor', 'PhD', 'Bachelor', 'Master', 'High School', 'Bachelor'],  # Ordinal
    'Satisfaction': ['Good', 'Excellent', 'Fair', 'Good', 'Excellent', 'Poor', 'Good', 'Fair']  # Ordinal
}

df = pd.DataFrame(data)
print("Sample Dataset:")
print(df)
print("\nData Types:")
print(df.dtypes)

## 3. Scales of Measurement

Scales of measurement describe the type of information encoded in a variable. There are four main levels:

### **Nominal Scale**
- **Purpose:** Categorization and identification
- **Operations:** Equality/inequality only
- **Examples:** gender, color, nationality
- **Appropriate Statistics:** Mode, frequency, chi-square

### **Ordinal Scale**
- **Purpose:** Ranking and ordering
- **Operations:** Equality, greater than, less than
- **Examples:** education level, movie ratings, satisfaction scale
- **Appropriate Statistics:** Mode, median, percentile

### **Interval Scale**
- **Purpose:** Measurement with equal intervals
- **Operations:** All ordinal operations + addition/subtraction
- **Examples:** temperature in Celsius, test scores, year
- **Appropriate Statistics:** Mean, standard deviation, correlation
- **Note:** No true zero point

### **Ratio Scale**
- **Purpose:** Measurement with meaningful zero
- **Operations:** All interval operations + multiplication/division
- **Examples:** height, weight, age, income
- **Appropriate Statistics:** Mean, standard deviation, ratios, geometric mean
- **Note:** Has a true zero point and meaningful ratios

| Scale | Equality | Order | Equal Intervals | True Zero | Examples |
|-------|----------|-------|-----------------|-----------|----------|
| Nominal | Yes | No | No | No | Gender, Color |
| Ordinal | Yes | Yes | No | No | Rating, Ranking |
| Interval | Yes | Yes | Yes | No | Temperature (C), Year |
| Ratio | Yes | Yes | Yes | Yes | Height, Weight, Age |

## 4. Dependent vs Independent Variables

### **Independent Variable (Predictor)**
- The variable that is **manipulated or controlled**
- The **cause** in a cause-and-effect relationship
- Also called **predictor** or **feature** in machine learning
- Plotted on the **X-axis**
- **Examples:**
  - Study hours (affects exam score)
  - Advertising budget (affects sales)
  - Temperature (affects ice cream sales)

### **Dependent Variable (Target)**
- The variable that is **measured or observed**
- The **effect** in a cause-and-effect relationship
- Also called **response** or **target** in machine learning
- Plotted on the **Y-axis**
- **Examples:**
  - Exam score (depends on study hours)
  - Sales (depends on advertising budget)
  - Ice cream sales (depends on temperature)

### **Key Relationship:**
```
Independent Variable → Causes → Dependent Variable
       (X)                          (Y)
     Predictor                    Response
```

In [None]:
# Example: Relationship between study hours (independent) and exam score (dependent)
study_data = pd.DataFrame({
    'Study_Hours': [2, 4, 6, 8, 10, 12, 14, 16],  # Independent Variable (X)
    'Exam_Score': [45, 52, 68, 75, 82, 88, 91, 95]  # Dependent Variable (Y)
})

print("Study Hours vs Exam Score:")
print(study_data)

# Visualize the relationship
plt.figure(figsize=(10, 5))
plt.scatter(study_data['Study_Hours'], study_data['Exam_Score'], s=100, alpha=0.7)
plt.xlabel('Study Hours (Independent Variable)', fontsize=12)
plt.ylabel('Exam Score (Dependent Variable)', fontsize=12)
plt.title('Relationship Between Study Hours and Exam Score')
plt.grid(True, alpha=0.3)
plt.show()

# Calculate correlation
correlation = study_data['Study_Hours'].corr(study_data['Exam_Score'])
print(f"\nCorrelation: {correlation:.3f}")
print("Strong positive correlation indicates: More study hours lead to higher exam scores")

## 5. Python Examples with Pandas: Identifying and Encoding Variables

In [None]:
# Create a comprehensive dataset
customer_data = pd.DataFrame({
    'Customer_ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Age': [25, 32, 45, 28, 55, 38, 42, 31, 48, 26],  # Continuous, Ratio
    'Income': [45000, 62000, 85000, 58000, 120000, 72000, 95000, 68000, 110000, 52000],  # Continuous, Ratio
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],  # Categorical, Nominal
    'Education': ['High School', 'Bachelor', 'Master', 'Bachelor', 'PhD', 'Master', 'Bachelor', 'High School', 'Master', 'Bachelor'],  # Categorical, Ordinal
    'Satisfaction': [3, 4, 5, 3, 5, 4, 4, 3, 5, 4],  # Discrete, Ordinal (1-5 rating)
    'Purchase_Count': [5, 12, 8, 15, 3, 10, 7, 14, 6, 11]  # Discrete, Ratio
})

print("Customer Dataset:")
print(customer_data)
print("\n" + "="*50)
print("Data Information:")
print(customer_data.info())

### 5.1 Identifying Variable Types

In [None]:
# Method 1: Inspect data types
print("Data Types:")
print(customer_data.dtypes)
print("\n" + "="*50)

# Method 2: Identify numerical vs categorical
numerical_cols = customer_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = customer_data.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumerical Columns: {numerical_cols}")
print(f"Categorical Columns: {categorical_cols}")
print("\n" + "="*50)

# Method 3: Check unique values to classify further
print("\nUnique Values Count (helps identify variable types):")
for col in customer_data.columns:
    unique_count = customer_data[col].nunique()
    print(f"{col}: {unique_count} unique values")

# Method 4: Statistical summary
print("\n" + "="*50)
print("\nStatistical Summary of Numerical Variables:")
print(customer_data[numerical_cols].describe())

### 5.2 Encoding Categorical Variables

Before feeding data to machine learning models, categorical variables must be encoded into numerical format.

**Common Encoding Methods:**
1. **Label Encoding:** Convert categories to integers (0, 1, 2, ...)
2. **One-Hot Encoding:** Create binary columns for each category
3. **Ordinal Encoding:** Map categories to ordered integers based on ranking

In [None]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Create a copy for encoding examples
df_encoded = customer_data.copy()

# Method 1: Label Encoding (for Nominal Variables)
# Best for: Binary or nominal categories where order doesn't matter
print("METHOD 1: Label Encoding for Gender (Nominal Variable)")
print("Before:")
print(df_encoded[['Customer_ID', 'Gender']].head())

label_encoder = LabelEncoder()
df_encoded['Gender_Encoded'] = label_encoder.fit_transform(df_encoded['Gender'])
print("\nAfter Label Encoding:")
print(df_encoded[['Customer_ID', 'Gender', 'Gender_Encoded']].head())
print(f"\nMapping: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")

In [None]:
# Method 2: One-Hot Encoding (for Nominal Variables)
# Best for: Nominal categories that might be processed by ML algorithms
print("METHOD 2: One-Hot Encoding for Gender (Nominal Variable)")
gender_onehot = pd.get_dummies(df_encoded['Gender'], prefix='Gender')
df_with_onehot = pd.concat([df_encoded[['Customer_ID', 'Gender']], gender_onehot], axis=1)
print(df_with_onehot.head())
print("\nAdvantage: No ordinal relationship is implied between categories")

In [None]:
# Method 3: Ordinal Encoding (for Ordinal Variables)
# Best for: Categorical variables with natural ordering
print("METHOD 3: Ordinal Encoding for Education Level (Ordinal Variable)")
print("Before:")
print(df_encoded[['Customer_ID', 'Education']].head())

# Define the order for education levels
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_mapping = {edu: i for i, edu in enumerate(education_order)}

df_encoded['Education_Encoded'] = df_encoded['Education'].map(education_mapping)
print("\nAfter Ordinal Encoding:")
print(df_encoded[['Customer_ID', 'Education', 'Education_Encoded']].head())
print(f"\nMapping: {education_mapping}")
print("\nImportant: Order is preserved (High School < Bachelor < Master < PhD)")

In [None]:
# Summary of all encodings
print("\nSUMMARY: Encoded Dataset")
print(df_encoded[['Customer_ID', 'Gender', 'Gender_Encoded', 'Education', 'Education_Encoded']].head(10))

## 6. Appropriate Statistics and Visualizations for Each Variable Type

### 6.1 Numerical Variables (Continuous and Discrete)

#### **Appropriate Statistics:**
- **Central Tendency:** Mean, Median, Mode
- **Dispersion:** Range, Variance, Standard Deviation, IQR
- **Position:** Percentiles, Quartiles
- **Relationship:** Correlation, Covariance

#### **Appropriate Visualizations:**
- Histogram (distribution)
- Box plot (outliers, quartiles)
- Scatter plot (relationship with other variables)
- Density plot (probability distribution)
- Line plot (trend over time)

In [None]:
# Statistics for Numerical Variables
print("NUMERICAL VARIABLES - STATISTICS")
print("\nAge Statistics:")
print(f"Mean: {customer_data['Age'].mean():.2f}")
print(f"Median: {customer_data['Age'].median():.2f}")
print(f"Mode: {customer_data['Age'].mode().values[0]:.2f}")
print(f"Std Dev: {customer_data['Age'].std():.2f}")
print(f"Range: {customer_data['Age'].max() - customer_data['Age'].min()}")
print(f"Q1 (25%): {customer_data['Age'].quantile(0.25):.2f}")
print(f"Q3 (75%): {customer_data['Age'].quantile(0.75):.2f}")

# Correlation for numerical variables
print("\n" + "="*50)
print("\nCorrelation between Numerical Variables:")
numerical_cols = ['Age', 'Income', 'Satisfaction', 'Purchase_Count']
print(customer_data[numerical_cols].corr())

In [None]:
# Visualizations for Numerical Variables
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram
axes[0, 0].hist(customer_data['Age'], bins=8, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Histogram: Age Distribution (Continuous)')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Box Plot
axes[0, 1].boxplot([customer_data['Age'], customer_data['Income']/1000])
axes[0, 1].set_title('Box Plot: Age and Income Distribution')
axes[0, 1].set_xticklabels(['Age', 'Income (in thousands)'])
axes[0, 1].set_ylabel('Values')

# Scatter Plot
axes[1, 0].scatter(customer_data['Age'], customer_data['Income'], s=100, alpha=0.7, color='green')
axes[1, 0].set_title('Scatter Plot: Age vs Income')
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Income')
axes[1, 0].grid(True, alpha=0.3)

# Density Plot
customer_data['Income'].plot(kind='density', ax=axes[1, 1], color='orange')
axes[1, 1].set_title('Density Plot: Income Distribution')
axes[1, 1].set_xlabel('Income')

plt.tight_layout()
plt.show()

### 6.2 Categorical Variables (Nominal)

#### **Appropriate Statistics:**
- **Frequency:** Count, Percentage
- **Central Tendency:** Mode
- **Association:** Chi-square test

#### **Appropriate Visualizations:**
- Bar chart (frequency of categories)
- Pie chart (proportion of categories)
- Contingency table (relationship between two categorical variables)

In [None]:
# Statistics for Categorical Variables (Nominal)
print("CATEGORICAL VARIABLES (NOMINAL) - STATISTICS")
print("\nGender Frequency Distribution:")
gender_counts = customer_data['Gender'].value_counts()
print(gender_counts)

print("\nGender Percentage Distribution:")
gender_percentage = customer_data['Gender'].value_counts(normalize=True) * 100
print(gender_percentage.round(2))

print("\nMode: ", customer_data['Gender'].mode()[0])
print("\n" + "="*50)

In [None]:
# Visualizations for Categorical Variables (Nominal)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar Chart
gender_counts = customer_data['Gender'].value_counts()
axes[0].bar(gender_counts.index, gender_counts.values, color=['steelblue', 'coral'], edgecolor='black')
axes[0].set_title('Bar Chart: Gender Distribution (Nominal)')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Frequency')
axes[0].grid(True, alpha=0.3, axis='y')

# Pie Chart
axes[1].pie(gender_counts.values, labels=gender_counts.index, autopct='%1.1f%%', colors=['steelblue', 'coral'])
axes[1].set_title('Pie Chart: Gender Proportion (Nominal)')

plt.tight_layout()
plt.show()

### 6.3 Categorical Variables (Ordinal)

#### **Appropriate Statistics:**
- **Frequency:** Count, Percentage
- **Central Tendency:** Mode, Median
- **Position:** Percentiles
- **Association:** Spearman's rank correlation

#### **Appropriate Visualizations:**
- Ordered bar chart (respecting category order)
- Stacked bar chart (multiple ordinal variables)
- Heatmap (relationship between ordinal and other variables)

In [None]:
# Statistics for Categorical Variables (Ordinal)
print("CATEGORICAL VARIABLES (ORDINAL) - STATISTICS")
print("\nEducation Level Frequency Distribution:")
education_counts = customer_data['Education'].value_counts()
print(education_counts)

print("\nEducation Level (Ordered):")
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_ordered = customer_data['Education'].value_counts().reindex(education_order)
print(education_ordered)

print("\nSatisfaction Rating Distribution:")
print(customer_data['Satisfaction'].value_counts().sort_index())
print(f"\nMedian Satisfaction: {customer_data['Satisfaction'].median():.2f}")

# Spearman correlation for ordinal variables
from scipy.stats import spearmanr
correlation, p_value = spearmanr(customer_data['Satisfaction'], customer_data['Income'])
print(f"\nSpearman Correlation (Satisfaction vs Income): {correlation:.3f} (p-value: {p_value:.4f})")

In [None]:
# Visualizations for Categorical Variables (Ordinal)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ordered Bar Chart for Education
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
education_ordered = customer_data['Education'].value_counts().reindex(education_order)
colors = ['#ff9999', '#ffcc99', '#99ccff', '#99ff99']
axes[0].bar(range(len(education_ordered)), education_ordered.values, color=colors, edgecolor='black')
axes[0].set_title('Ordered Bar Chart: Education Level (Ordinal)')
axes[0].set_xlabel('Education Level')
axes[0].set_ylabel('Frequency')
axes[0].set_xticks(range(len(education_ordered)))
axes[0].set_xticklabels(education_ordered.index, rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Satisfaction Rating Distribution
satisfaction_counts = customer_data['Satisfaction'].value_counts().sort_index()
axes[1].bar(satisfaction_counts.index, satisfaction_counts.values, color='mediumpurple', edgecolor='black')
axes[1].set_title('Bar Chart: Customer Satisfaction (Ordinal)')
axes[1].set_xlabel('Satisfaction Rating')
axes[1].set_ylabel('Frequency')
axes[1].set_xticks(sorted(customer_data['Satisfaction'].unique()))
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### 6.4 Quick Reference Table

| Variable Type | Statistics | Visualizations | Encoding Method |
|---------------|-----------|-----------------|------------------|
| **Continuous** | Mean, Median, Std Dev, Correlation | Histogram, Box plot, Density plot, Scatter plot | Use as-is (no encoding needed) |
| **Discrete** | Mean, Median, Mode, Count | Histogram, Bar chart, Box plot | Use as-is or treat as categorical |
| **Nominal** | Mode, Frequency, Percentage, Chi-square | Bar chart, Pie chart, Contingency table | One-Hot Encoding, Label Encoding |
| **Ordinal** | Mode, Median, Percentile, Spearman correlation | Ordered Bar chart, Stacked bar chart, Heatmap | Ordinal Encoding, Label Encoding |

In [None]:
# Comprehensive Example: Analyzing All Variable Types Together
print("COMPREHENSIVE VARIABLE ANALYSIS")
print("="*60)

# Create analysis summary
analysis_summary = {
    'Variable': ['Age', 'Income', 'Gender', 'Education', 'Satisfaction', 'Purchase_Count'],
    'Type': ['Continuous', 'Continuous', 'Nominal', 'Ordinal', 'Ordinal', 'Discrete'],
    'Scale': ['Ratio', 'Ratio', 'Nominal', 'Ordinal', 'Ordinal', 'Ratio'],
    'Example Statistic': [
        f"{customer_data['Age'].mean():.1f} (Mean)",
        f"{customer_data['Income'].mean():.0f} (Mean)",
        f"{customer_data['Gender'].mode()[0]} (Mode)",
        f"{customer_data['Education'].mode()[0]} (Mode)",
        f"{customer_data['Satisfaction'].median():.1f} (Median)",
        f"{customer_data['Purchase_Count'].mean():.1f} (Mean)"
    ],
    'Best Visualization': ['Histogram', 'Scatter plot', 'Pie chart', 'Ordered bar', 'Bar chart', 'Histogram']
}

summary_df = pd.DataFrame(analysis_summary)
print(summary_df.to_string(index=False))

## Summary

### Key Takeaways

1. **Variables are the foundation of data analysis** - They represent the characteristics we measure and analyze.

2. **Two main variable categories:**
   - **Numerical:** Continuous (height, temperature) and Discrete (count, number)
   - **Categorical:** Nominal (color, gender) and Ordinal (rating, ranking)

3. **Four scales of measurement:**
   - **Nominal:** Categories only (no order)
   - **Ordinal:** Categories with meaningful order
   - **Interval:** Ordered with equal intervals (no true zero)
   - **Ratio:** Ordered with equal intervals and true zero

4. **Independent vs Dependent variables:**
   - **Independent:** The predictor or cause (X-axis)
   - **Dependent:** The response or effect (Y-axis)

5. **Encoding categorical variables:**
   - **One-Hot Encoding:** For nominal variables
   - **Ordinal Encoding:** For ordinal variables preserving order
   - **Label Encoding:** For binary or nominal variables

6. **Choose appropriate statistics and visualizations:**
   - **Numerical:** Mean, std dev, histogram, scatter plot
   - **Categorical:** Mode, frequency, bar chart, pie chart
   - **Ordinal:** Median, percentile, ordered bar chart

### Further Learning

- Practice identifying variable types in real datasets
- Experiment with different encoding methods and compare their impact on model performance
- Learn about feature engineering to create meaningful new variables from existing ones
- Study the assumptions of different statistical tests based on variable types