# **📘Day 1 - Pandas Series & DataFrame 🐼**

#### **Goal:** Grasp the two fundamental building blocks of Pandas: Series (1D) and DataFrames (2D).

#### **Topics To Cover:** Pandas Introduction, Series creation, indexing, key attributes, DataFrame creation, and basic operations.

----

## **Introduction to Pandas 🐼**
Pandas (Python Data Analysis Library) is the Swiss Army knife for data manipulation and analysis in Python. It provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It is an essential tool for almost every data science workflow.

**Why is Pandas Important? 💡**

* **Structured Data:** It’s built to handle tabular data (like spreadsheets or SQL tables) effortlessly.

* **Clean and Manipulate Data:** It simplifies the messy process of data cleaning, transformation, and preparation for analysis or machine learning models.

* **Efficiency:** Under the hood, Pandas leverages the speed of NumPy, allowing for quick computations on large datasets.

----

## **Component 1: The Pandas Series 🥇**
**Definition:**
A Pandas Series is the most basic building block. It is a one-dimensional array-like object that can hold data of any single type (e.g., integers, strings, floating-point numbers, Python objects).

**Key Feature:**
Its key feature that distinguishes it from a simple NumPy array is the labeled index. This index acts as a unique label for each element, which makes data access, alignment, and manipulation extremely powerful.

**Think of a Series as:** A single, named column from a spreadsheet or a SQL table.

### **Key Characteristics of a Series** ✨ <br>
A Series is characterized by three main components:

* **Data (Values)**: The actual content stored, which can be created from a list, dictionary, or a NumPy array.

* **Index (Labels)**: The labels for each row. If you don't provide one, Pandas defaults to a simple integer index starting at 0.

* **Data Type (dtype)**: The type of data stored in the series (e.g., int64 for integers, object for strings).

In [1]:
# import the necessary libraries
import pandas as pd
import numpy as np

# create a sample Series
series = pd.Series([1, 2, 3, 4, 5])
print(series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [2]:
# Give and display the index of the Series
series.index = ['a', 'b', 'c', 'd', 'e']
print(series)

a    1
b    2
c    3
d    4
e    5
dtype: int64


### Accessing Elements

In [3]:
# Access by position index (deprecated way) so use .iloc for position based access
# print(series[2])  # Output: 3

# Access by index label
print(series['c'])  # Output: 3

3


### Slicing

In [4]:
# Slicing by position (deprecated way) so use .iloc for position based slicing
# print(series[1:4])  # Output: b    2

# Slicing by index label
print(series['b':'d'])  # Output: b    2

b    2
c    3
d    4
dtype: int64


### Filtering

In [5]:
# Filtering with boolean masking
filtered_series = series[series > 2]
print(filtered_series)

c    3
d    4
e    5
dtype: int64


----

## **Component 2: The Pandas DataFrame 🏗️**
**Definition**
A Pandas DataFrame is the most commonly used object in Pandas. It is a two-dimensional labeled data structure with columns that can individually be of different types. It can be thought of as a dictionary of Series objects that share a common index.

**The Role of the DataFrame 🎯**
The DataFrame is the fundamental structure for data analysis. It perfectly mirrors real-world datasets like SQL tables or Excel sheets, providing a structured, table-like view of your data. It is the go-to structure for tasks like filtering, grouping, merging, and statistical modeling.

**Think of a DataFrame as:** The entire spreadsheet or SQL table, composed of rows (indexed) and named columns.


### **Key Characteristics of a DataFrame 🔑**

A DataFrame is essentially a two-dimensional structure defined by four major characteristics:

* **Structure (Rows & Columns):** It has both a row index (like a Series) and a column index. The combination of these two indexes allows for precise, labeled data access.
* **Column Homogeneity:** While the entire DataFrame can have mixed types (e.g., one column of integers, one of strings), each individual **column** (which is a Series) must hold values of the **same data type** (`dtype`).
* **Size Mutability:** You can easily add or delete columns from a DataFrame.
* **Row Mutability:** You can easily add or delete rows from a DataFrame.

| Feature | Description |
| :--- | :--- |
| **Index** | Labels for the **rows** (the `df.index` attribute). |
| **Columns** | Labels for the **columns** (the `df.columns` attribute). |
| **Values** | The actual data stored, often accessible as a NumPy array (the `df.values` attribute). |
| **Dtypes** | The data type of each column (the `df.dtypes` attribute). |

In [6]:
data = pd.read_csv(r'..\data\Students Social Media Addiction.csv')
df = pd.DataFrame(data)
df.head() # Givest the first 5 rows of the DataFrame

Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
0,1,19,Female,Undergraduate,Bangladesh,5.2,Instagram,Yes,6.5,6,In Relationship,3,8
1,2,22,Male,Graduate,India,2.1,Twitter,No,7.5,8,Single,0,3
2,3,20,Female,Undergraduate,USA,6.0,TikTok,Yes,5.0,5,Complicated,4,9
3,4,18,Male,High School,UK,3.0,YouTube,No,7.0,7,Single,1,4
4,5,21,Male,Graduate,Canada,4.5,Facebook,Yes,6.0,6,In Relationship,2,7


In [7]:
df.tail() # Gives the last 5 rows of the DataFrame


Unnamed: 0,Student_ID,Age,Gender,Academic_Level,Country,Avg_Daily_Usage_Hours,Most_Used_Platform,Affects_Academic_Performance,Sleep_Hours_Per_Night,Mental_Health_Score,Relationship_Status,Conflicts_Over_Social_Media,Addicted_Score
700,701,20,Female,Undergraduate,Italy,4.7,TikTok,No,7.2,7,In Relationship,2,5
701,702,23,Male,Graduate,Russia,6.8,Instagram,Yes,5.9,4,Single,5,9
702,703,21,Female,Undergraduate,China,5.6,WeChat,Yes,6.7,6,In Relationship,3,7
703,704,24,Male,Graduate,Japan,4.3,Twitter,No,7.5,8,Single,2,4
704,705,19,Female,Undergraduate,Poland,6.2,Facebook,Yes,6.3,5,Single,4,8


In [8]:

df.shape # Gives the shape of the DataFrame (rows, columns)

(705, 13)

In [9]:
df.columns # Gives the column names of the DataFrame


Index(['Student_ID', 'Age', 'Gender', 'Academic_Level', 'Country',
       'Avg_Daily_Usage_Hours', 'Most_Used_Platform',
       'Affects_Academic_Performance', 'Sleep_Hours_Per_Night',
       'Mental_Health_Score', 'Relationship_Status',
       'Conflicts_Over_Social_Media', 'Addicted_Score'],
      dtype='object')

In [10]:
df.index # Gives the index names of the DataFrame

RangeIndex(start=0, stop=705, step=1)

In [11]:
df.info() # Gives a concise summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Student_ID                    705 non-null    int64  
 1   Age                           705 non-null    int64  
 2   Gender                        705 non-null    object 
 3   Academic_Level                705 non-null    object 
 4   Country                       705 non-null    object 
 5   Avg_Daily_Usage_Hours         705 non-null    float64
 6   Most_Used_Platform            705 non-null    object 
 7   Affects_Academic_Performance  705 non-null    object 
 8   Sleep_Hours_Per_Night         705 non-null    float64
 9   Mental_Health_Score           705 non-null    int64  
 10  Relationship_Status           705 non-null    object 
 11  Conflicts_Over_Social_Media   705 non-null    int64  
 12  Addicted_Score                705 non-null    int64  
dtypes: fl

In [12]:
df.describe() # Gives statistical summary of numerical columns


Unnamed: 0,Student_ID,Age,Avg_Daily_Usage_Hours,Sleep_Hours_Per_Night,Mental_Health_Score,Conflicts_Over_Social_Media,Addicted_Score
count,705.0,705.0,705.0,705.0,705.0,705.0,705.0
mean,353.0,20.659574,4.918723,6.868936,6.22695,2.849645,6.436879
std,203.660256,1.399217,1.257395,1.126848,1.105055,0.957968,1.587165
min,1.0,18.0,1.5,3.8,4.0,0.0,2.0
25%,177.0,19.0,4.1,6.0,5.0,2.0,5.0
50%,353.0,21.0,4.8,6.9,6.0,3.0,7.0
75%,529.0,22.0,5.8,7.7,7.0,4.0,8.0
max,705.0,24.0,8.5,9.6,9.0,5.0,9.0


In [13]:
df.dtypes # Gives the data types of each column


Student_ID                        int64
Age                               int64
Gender                           object
Academic_Level                   object
Country                          object
Avg_Daily_Usage_Hours           float64
Most_Used_Platform               object
Affects_Academic_Performance     object
Sleep_Hours_Per_Night           float64
Mental_Health_Score               int64
Relationship_Status              object
Conflicts_Over_Social_Media       int64
Addicted_Score                    int64
dtype: object

In [14]:

# Accessing a single column
df['Age']  # Accessing the 'Age' column


0      19
1      22
2      20
3      18
4      21
       ..
700    20
701    23
702    21
703    24
704    19
Name: Age, Length: 705, dtype: int64

In [15]:
df.Age     # Another way to access the 'Age' column


0      19
1      22
2      20
3      18
4      21
       ..
700    20
701    23
702    21
703    24
704    19
Name: Age, Length: 705, dtype: int64

In [16]:

# Accessing multiple columns
df[['Country', 'Age']]  # Accessing 'Country' and 'Age' columns

Unnamed: 0,Country,Age
0,Bangladesh,19
1,India,22
2,USA,20
3,UK,18
4,Canada,21
...,...,...
700,Italy,20
701,Russia,23
702,China,21
703,Japan,24
