# Data Exploration with Pandas

This notebook focuses on **exploring datasets** in Pandas to understand their structure, content, and quality. You'll learn how to use key methods to quickly assess your data, identify patterns, and spot potential issues like missing values or incorrect data types.

## Core Concepts

- **Data Exploration**: The process of inspecting a dataset to understand its structure, content, and quality before analysis or modeling.
- **Key Goals**:
  - Assess dataset size, structure, and data types.
  - Identify missing values, duplicates, or anomalies.
  - Understand the distribution and statistical properties of the data.
  - Familiarize yourself with the dataset's columns and values.

## Key Methods & Functions

Below are the essential methods for exploring data in Pandas:

- **`.head(n)`**: Displays the first `n` rows (default `n=5`).
- **`.tail(n)`**: Displays the last `n` rows (default `n=5`).
- **`.info()`**: Shows DataFrame structure, including column names, data types, and missing values.
- **`.describe()`**: Provides a statistical summary (count, mean, std, min, max, quartiles) for numerical columns.
- **`.value_counts()`**: Counts unique values in a Series (useful for categorical data).
- **`.unique()`**: Returns an array of unique values in a Series.
- **`.nunique()`**: Counts the number of unique values in a Series or DataFrame.
- **`.sample(n)`**: Returns a random sample of `n` rows.
- **`.columns.tolist()`**: Returns column names as a list.

## Learning Objectives

- Learn how to quickly assess data quality and structure.
- Understand statistical summaries and their implications.
- Identify data types, missing values, and potential inconsistencies.
- Get familiar with the distribution of categorical and numerical data.

### 1. Setting Up a Sample Dataset

In [1]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', None],
    'age': [25, 30, 35, 28, np.nan, 40],
    'salary': [50000, 60000, 75000, 52000, 48000, 80000],
    'department': ['HR', 'IT', 'IT', 'Marketing', 'HR', 'IT'],
    'years_experience': [2, 5, 10, 3, 1, 15]
}
df = pd.DataFrame(data)

### 2. Basic Data Inspection

In [2]:
# Viewing the first 3 rows
print("First 3 rows:")
print(df.head(3))

# Viewing the last 2 rows
print("\nLast 2 rows:")
print(df.tail(2))

# Getting a random sample of 2 rows
print("\nRandom sample of 2 rows:")
print(df.sample(2))

First 3 rows:
      name   age  salary department  years_experience
0    Alice  25.0   50000         HR                 2
1      Bob  30.0   60000         IT                 5
2  Charlie  35.0   75000         IT                10

Last 2 rows:
   name   age  salary department  years_experience
4   Eve   NaN   48000         HR                 1
5  None  40.0   80000         IT                15

Random sample of 2 rows:
    name   age  salary department  years_experience
3  David  28.0   52000  Marketing                 3
4    Eve   NaN   48000         HR                 1


### 3. Understanding Data Structure

**Explanation**: `.info()` shows non-null counts, helping identify missing values (e.g., `name` and `age` have missing values). It also displays data types and memory usage.

In [3]:
# Getting DataFrame structure and memory usage
print("DataFrame info:")
print(df.info())

# Listing column names
print("\nColumn names:")
print(df.columns.tolist())

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              5 non-null      object 
 1   age               5 non-null      float64
 2   salary            6 non-null      int64  
 3   department        6 non-null      object 
 4   years_experience  6 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 372.0+ bytes
None

Column names:
['name', 'age', 'salary', 'department', 'years_experience']


### 4. Statistical Summary

**Explanation**: `.describe()` provides count, mean, standard deviation, min, max, and quartiles for numerical columns, helping you understand the data's distribution.

In [4]:
# Getting statistical summary of numerical columns
print("Statistical summary:")
print(df.describe())

Statistical summary:
            age        salary  years_experience
count   5.00000      6.000000          6.000000
mean   31.60000  60833.333333          6.000000
std     5.94138  13629.624597          5.440588
min    25.00000  48000.000000          1.000000
25%    28.00000  50500.000000          2.250000
50%    30.00000  56000.000000          4.000000
75%    35.00000  71250.000000          8.750000
max    40.00000  80000.000000         15.000000


### 5. Exploring Categorical Data

**Explanation**: `.value_counts()` is great for categorical data to see frequency distributions. `.unique()` and `.nunique()` help understand the variety of categories.

In [5]:
# Counting unique values in 'department'
print("Department value counts:")
print(df['department'].value_counts())

# Getting unique values in 'department'
print("\nUnique departments:")
print(df['department'].unique())

# Counting unique values in 'department'
print("\nNumber of unique departments:")
print(df['department'].nunique())

Department value counts:
department
IT           3
HR           2
Marketing    1
Name: count, dtype: int64

Unique departments:
['HR' 'IT' 'Marketing']

Number of unique departments:
3


## Key Takeaways

- **Quick Inspection**: Use `.head()`, `.tail()`, and `.sample()` to get a snapshot of your data.
- **Structure and Quality**: `.info()` reveals data types, missing values, and memory usage.
- **Statistical Insights**: `.describe()` summarizes numerical data, highlighting potential outliers or skewed distributions.
- **Categorical Analysis**: `.value_counts()`, `.unique()`, and `.nunique()` are essential for understanding categorical columns.
- **Data Quality**: Look for missing values (e.g., `name` and `age` in our example) and incorrect data types to plan cleaning steps.