<a href="https://colab.research.google.com/github/yellowgram1543/6-Stages-of-AIML/blob/main/AIML0_Day3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd

## **Series**

A **Series** is one of the two primary data structures in pandas (the other being DataFrame). It's a **one-dimensional labeled array** capable of holding data of any type (integers, strings, floating point numbers, Python objects, etc.). Think of it as a single column of data with an associated index.

- One-dimensional: Contains a single column of data
- Homogeneous data type: All elements typically have the same data type (though technically can hold mixed types)
- Labeled index: Each element has an associated label (index)
- Size-immutable: You cannot change the size of a Series after creation
- Values are mutable: You can modify the values within the Series

In [4]:
import pandas as pd
import numpy as np

# 1. From a Python list
s1 = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s1, "\n")

# 2. With custom index
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s2, "\n")

# 3. From a dictionary
data_dict = {'apple': 45, 'banana': 30, 'orange': 55}
s3 = pd.Series(data_dict)
print(s3, "\n")

# 4. From a scalar value (creates Series with repeated value)
s4 = pd.Series(5, index=['x', 'y', 'z'])
print(s4, "\n")

# 5. From NumPy array
arr = np.array([1, 2, 3, 4])
s5 = pd.Series(arr)
print(s5)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64 

a    10
b    20
c    30
d    40
dtype: int64 

apple     45
banana    30
orange    55
dtype: int64 

x    5
y    5
z    5
dtype: int64 

0    1
1    2
2    3
3    4
dtype: int64


**Series Attributes**

In [6]:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# Access the underlying data as a NumPy array
print(s.values, "\n")

# Access the index
print(s.index, "\n")

# Get the data type
print(s.dtype, "\n")

# Get the shape
print(s.shape, "\n")

# Get the size (number of elements)
print(s.size, "\n")

# Check if Series is empty
print(s.empty)

[10 20 30 40] 

Index(['a', 'b', 'c', 'd'], dtype='object') 

int64 

(4,) 

4 

False


**Indexing**

In [10]:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# Single label
print(s['a'])        # 10

# Multiple labels
print(s[['a', 'c']])

# Using .loc for explicit label-based access
print(s.loc['b'], "\n")    # 20
print(s.loc[['b', 'd']], "\n")

print(s.iloc[0], "\n")     # 10 (first element)
print(s.iloc[1:3], "\n")   # Elements at positions 1 and 2

mask = s > 25
print(mask)

10
a    10
c    30
dtype: int64
20 

b    20
d    40
dtype: int64 

10 

b    20
c    30
dtype: int64


**Series Operation**

In [None]:
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

# Element-wise operations
print(s1 + s2)
print(s2 / s1)

In [None]:
s = pd.Series([1, np.nan, 3, np.nan, 5])

# Check for missing values
print(s.isnull())

# Drop missing values
print(s.dropna())

# Fill missing values
print(s.fillna(0))

**Statstical Methods**

In [None]:
s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

print(s.sum())      # 55
print(s.mean())     # 5.5
print(s.median())   # 5.5
print(s.std())      # 3.0276503540974917
print(s.min())      # 1
print(s.max())      # 10
print(s.count())    # 10 (counts non-null values)

**Sorting**

In [None]:
s = pd.Series([3, 1, 4, 1, 5], index=['e', 'a', 'c', 'b', 'd'])

# Sort by values
print(s.sort_values())

# Sort by index
print(s.sort_index())

**String Operations**

In [None]:
s = pd.Series(['apple', 'banana', 'cherry'])

print(s.str.upper())
print(s.str.len())

**Alignment Feature**

One of pandas' most powerful features is automatic alignment based on index labels

In [11]:
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

# When performing operations, pandas aligns by index
result = s1 + s2
print(result, "\n")

# To handle missing values in alignment
result_filled = s1.add(s2, fill_value=0)
print(result_filled)
# a    1.0  (1+0)
# b    6.0  (1+5)
# c    8.0  (2+6)
# d    6.0  (0+6)

a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64 

a    1.0
b    6.0
c    8.0
d    6.0
dtype: float64


## Dataframes

A **DataFrame** is the primary data structure in pandas and represents a **two-dimensional, size-mutable, potentially heterogeneous tabular data structure** with **labeled axes** (rows and columns). Think of it as a spreadsheet or SQL table, where:

- Rows represent observations or records
- Columns represent variables or features
- Each column can have a different data type (integers, floats, strings, booleans, etc.)
- Both rows and columns have labels (indices)

In [12]:
import pandas as pd
import numpy as np

# 1. From a dictionary of lists/arrays
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Paris', 'Tokyo'],
    'Salary': [50000, 60000, 70000, 55000]
}
df1 = pd.DataFrame(data)
print(df1)

      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   70000
3    Diana   28     Tokyo   55000


In [None]:
# 2. From a list of dictionaries
data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'London'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Paris'}
]
df2 = pd.DataFrame(data_list)
print(df2)

In [13]:
# 3. From a NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df3 = pd.DataFrame(arr, columns=['A', 'B', 'C'], index=['row1', 'row2', 'row3'])
print(df3)

      A  B  C
row1  1  2  3
row2  4  5  6
row3  7  8  9


In [17]:
# 4. From another DataFrame (copying)
df4 = df1.copy()
print("head \n", df4.head(), "\n")

# 5. From CSV file (common real-world scenario)
# df5 = pd.read_csv('data.csv')

# 6. From Excel file
# df6 = pd.read_excel('data.xlsx')

# 7. Creating empty DataFrame
df_empty = pd.DataFrame()
print(df_empty, "\n")

head 
       Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   70000
3    Diana   28     Tokyo   55000 

Empty DataFrame
Columns: []
Index: [] 



**Specifying Index and Columns**

In [18]:
# Custom index and columns
data = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(
    data,
    index=['first', 'second'],
    columns=['col_A', 'col_B', 'col_C']
)
print(df)

        col_A  col_B  col_C
first       1      2      3
second      4      5      6


**DataFrame Attributes**

In [None]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4.0, 5.0, 6.0],
    'C': ['x', 'y', 'z']
})

# Shape (rows, columns)
print(f"Shape: {df.shape}")  # Shape: (3, 3)

# Number of dimensions
print(f"Dimensions: {df.ndim}")  # Dimensions: 2

# Total number of elements
print(f"Size: {df.size}")  # Size: 9

# Column names
print(f"Columns: {df.columns.tolist()}")  # Columns: ['A', 'B', 'C']

# Row index
print(f"Index: {df.index.tolist()}")  # Index: [0, 1, 2]

# Data types of each column
print(f"Data types:\n{df.dtypes}")

# Check if DataFrame is empty
print(f"Empty: {df.empty}")  # Empty: False

# Underlying NumPy array (if homogeneous)
print(f"Values:\n{df.values}")

**Column Selection**

In [19]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
})

# Single column (returns Series)
print(df['Name'])
# Output: Series with Name column

# Multiple columns (returns DataFrame)
print(df[['Name', 'Age']])

# Using dot notation (only for valid Python identifiers)
print(df.Name)  # Same as df['Name']

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


**Row Selection**

In [21]:
# Using .loc (label-based)
print(df.loc[0], "\n")  # First row as Series
print(df.loc[[0, 2]], "\n")  # First and third rows as DataFrame

# Using .iloc (position-based)
print(df.iloc[0], "\n")  # First row
print(df.iloc[0:2], "\n")  # First two rows (exclusive of end)

# Boolean indexing
print(df[df['Age'] > 25])  # Rows where Age > 25

Name      Alice
Age          25
Salary    50000
Name: 0, dtype: object 

      Name  Age  Salary
0    Alice   25   50000
2  Charlie   35   70000 

Name      Alice
Age          25
Salary    50000
Name: 0, dtype: object 

    Name  Age  Salary
0  Alice   25   50000
1    Bob   30   60000 

      Name  Age  Salary
1      Bob   30   60000
2  Charlie   35   70000


In [24]:
# Select specific rows and columns
print(df.loc[0:1, ['Name', 'Salary']])  # Rows 0-1, columns Name and Salary
print(df.iloc[0:2, [0, 2]])  # First 2 rows, first and third columns

# Using query method (string expression)
print(df.query('Age > 25 and Salary < 70000'))

    Name  Salary
0  Alice   50000
1    Bob   60000
    Name  Salary
0  Alice   50000
1    Bob   60000
  Name  Age  Salary
1  Bob   30   60000


**Basic Operations**

In [28]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Add new column
df['C'] = [7, 8, 9]
df['D'] = df['A'] + df['B']  # Calculated column
print("df with D \n", df)

# Remove columns
df_dropped = df.drop('C', axis=1)  # axis=1 for columns
df.drop('C', axis=1, inplace=True)  # Modify original
print(df)

# Add column using assign (returns new DataFrame)
df_new = df.assign(E=df['A'] * 2)
print("\n df new", df_new)

# Rename columns
df_renamed = df.rename(columns={'A': 'NewA', 'B': 'NewB'})
print("\n df renamed", df_renamed)

df with D 
    A  B  C  D
0  1  4  7  5
1  2  5  8  7
2  3  6  9  9
   A  B  D
0  1  4  5
1  2  5  7
2  3  6  9

 df new    A  B  D  E
0  1  4  5  2
1  2  5  7  4
2  3  6  9  6

 df renamed    NewA  NewB  D
0     1     4  5
1     2     5  7
2     3     6  9


In [30]:
# Add row using concat (preferred method)
new_row = pd.DataFrame({'A': [4], 'B': [7], 'D': [11], 'E': [8]})
df_updated = pd.concat([df, new_row], ignore_index=True)
print(df_updated)

# Remove rows
df_no_first = df.drop(0)  # Remove row with index 0
print(df_no_first)
df_filtered = df[df['A'] != 2]  # Remove rows where A equals 2
print(df_filtered)

   A  B   D    E
0  1  4   5  NaN
1  2  5   7  NaN
2  3  6   9  NaN
3  4  7  11  8.0
   A  B  D
1  2  5  7
2  3  6  9
   A  B  D
0  1  4  5
2  3  6  9


**Data Inspection and Summary**

In [35]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000],
    'Department': ['IT', 'HR', 'IT', 'Finance']
})

# First few rows
print(df.head(2))

# Last few rows
print(df.tail(2))

# Summary statistics (numeric columns only)
print("\n describe \n", df.describe())

# Information about DataFrame
print("\n info \n", df.info())

# Check for missing values
print("\n missing values \n", df.isnull().sum())

    Name  Age  Salary Department
0  Alice   25   50000         IT
1    Bob   30   60000         HR
      Name  Age  Salary Department
2  Charlie   35   70000         IT
3    Diana   28   55000    Finance

 describe 
              Age        Salary
count   4.000000      4.000000
mean   29.500000  58750.000000
std     4.203173   8539.125638
min    25.000000  50000.000000
25%    27.250000  53750.000000
50%    29.000000  57500.000000
75%    31.250000  62500.000000
max    35.000000  70000.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        4 non-null      object
 1   Age         4 non-null      int64 
 2   Salary      4 non-null      int64 
 3   Department  4 non-null      object
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes

 info 
 None

 missing values 
 Name          0
Age           0
Salary        0
Department    0
dtype: int64

In [36]:
# Count unique values in a column
print(df['Department'].value_counts())

# Get unique values
print(df['Department'].unique())

# Check if values are in a list
print(df['Department'].isin(['IT', 'HR']))

Department
IT         2
HR         1
Finance    1
Name: count, dtype: int64
['IT' 'HR' 'Finance']
0     True
1     True
2     True
3    False
Name: Department, dtype: bool


In [None]:
df_with_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# Detect missing values
print(df_with_nan.isnull())

# Drop rows with any missing values
df_dropped = df_with_nan.dropna()

# Drop columns with any missing values
df_dropped_cols = df_with_nan.dropna(axis=1)

# Fill missing values
df_filled = df_with_nan.fillna(0)
df_filled_mean = df_with_nan.fillna(df_with_nan.mean())  # Fill with column mean

# Forward fill or backward fill
df_ffill = df_with_nan.fillna(method='ffill')
df_bfill = df_with_nan.fillna(method='bfill')

In [40]:
df = pd.DataFrame({
    'Name': ['Charlie', 'Alice', 'Bob'],
    'Age': [35, 25, 30],
    'Salary': [70000, 50000, 60000]
})

# Sort by single column
df_sorted = df.sort_values('Age')
print(df_sorted)

# Sort by multiple columns
df_sorted_multi = df.sort_values(['Age', 'Name'], ascending=[False, False])
print(df_sorted_multi)

# Sort by index
df_sorted_index = df.sort_index()
print(df_sorted_index)

      Name  Age  Salary
1    Alice   25   50000
2      Bob   30   60000
0  Charlie   35   70000
      Name  Age  Salary
0  Charlie   35   70000
2      Bob   30   60000
1    Alice   25   50000
      Name  Age  Salary
0  Charlie   35   70000
1    Alice   25   50000
2      Bob   30   60000


In [41]:
df = pd.DataFrame({
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT'],
    'Salary': [50000, 45000, 70000, 60000, 55000, 80000],
    'Age': [25, 30, 35, 28, 32, 40]
})

# Group by department and calculate mean
grouped = df.groupby('Department')
print(grouped.mean())

# Multiple aggregations
print(grouped.agg({
    'Salary': ['mean', 'sum', 'count'],
    'Age': ['min', 'max']
}))

# Group by multiple columns
df['Experience'] = ['Junior', 'Senior', 'Senior', 'Mid', 'Senior', 'Senior']
multi_group = df.groupby(['Department', 'Experience'])
print(multi_group.mean())

                  Salary        Age
Department                         
Finance     60000.000000  28.000000
HR          50000.000000  31.000000
IT          66666.666667  33.333333
                  Salary               Age    
                    mean     sum count min max
Department                                    
Finance     60000.000000   60000     1  28  28
HR          50000.000000  100000     2  30  32
IT          66666.666667  200000     3  25  40
                        Salary   Age
Department Experience               
Finance    Mid         60000.0  28.0
HR         Senior      50000.0  31.0
IT         Junior      50000.0  25.0
           Senior      75000.0  37.5


In [44]:
# Sample DataFrames
df1 = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'department_id': [101, 102, 101, 103]
})

df2 = pd.DataFrame({
    'department_id': [101, 102, 103, 104],
    'department_name': ['IT', 'HR', 'Finance', 'Marketing']
})

# Inner join (default)
merged_inner = pd.merge(df1, df2, on='department_id')
print(merged_inner)

# Left join
merged_left = pd.merge(df1, df2, on='department_id', how='left')
print(merged_left)

# Right join
merged_right = pd.merge(df1, df2, on='department_id', how='right')
print(merged_right)

# Outer join
merged_outer = pd.merge(df1, df2, on='department_id', how='outer')
print(merged_outer)

   employee_id     name  department_id department_name
0            1    Alice            101              IT
1            2      Bob            102              HR
2            3  Charlie            101              IT
3            4    Diana            103         Finance
   employee_id     name  department_id department_name
0            1    Alice            101              IT
1            2      Bob            102              HR
2            3  Charlie            101              IT
3            4    Diana            103         Finance
   employee_id     name  department_id department_name
0          1.0    Alice            101              IT
1          3.0  Charlie            101              IT
2          2.0      Bob            102              HR
3          4.0    Diana            103         Finance
4          NaN      NaN            104       Marketing
   employee_id     name  department_id department_name
0          1.0    Alice            101              IT
1         

In [45]:
df_sales = pd.DataFrame({
    'Region': ['North', 'South', 'North', 'South', 'North', 'South'],
    'Product': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 250, 120, 180]
})

# Create pivot table
pivot = df_sales.pivot_table(
    values='Sales',
    index='Region',
    columns='Product',
    aggfunc='sum'
)
print(pivot)

Product    A    B
Region           
North    220  200
South    150  430


| Function         | Purpose                                                                    |
| ---------------- | -------------------------------------------------------------------------- |
| `applymap(func)` | Apply a function **element-wise** on the entire DataFrame (deprecated now) |
| `apply(func)`    | Apply a function **column-wise** (or row-wise with `axis=1`)               |
| `agg(func_dict)` | Apply **different functions to different columns**                         |


In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Apply function to each element
df_squared = df.applymap(lambda x: x**2)  # Deprecated in newer versions
df_squared = df.apply(lambda x: x**2)

# Apply different functions to different columns
df_transformed = df.agg({
    'A': lambda x: x * 2,
    'B': lambda x: x + 10
})

In [None]:
# Create DataFrame with datetime index
dates = pd.date_range('2023-01-01', periods=6, freq='D')
df_time = pd.DataFrame({
    'Sales': [100, 120, 110, 130, 140, 150],
    'Expenses': [80, 90, 85, 95, 100, 105]
}, index=dates)

# Resample to weekly frequency
weekly = df_time.resample('W').sum()

# Rolling window operations
df_time['Rolling_Mean'] = df_time['Sales'].rolling(window=3).mean()

In [46]:
# Real-world example: Sales Analysis
import pandas as pd
import numpy as np

# Create sample sales data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100, freq='D')
products = np.random.choice(['Laptop', 'Phone', 'Tablet'], 100)
regions = np.random.choice(['North', 'South', 'East', 'West'], 100)
sales = np.random.randint(1000, 5000, 100)
profit_margin = np.random.uniform(0.1, 0.3, 100)

df = pd.DataFrame({
    'Date': dates,
    'Product': products,
    'Region': regions,
    'Sales': sales,
    'Profit_Margin': profit_margin
})

# Set Date as index
df.set_index('Date', inplace=True)

# Add calculated columns
df['Profit'] = df['Sales'] * df['Profit_Margin']

# Summary statistics
print("Overall Summary:")
print(df.describe())

# Group by product and region
summary = df.groupby(['Product', 'Region']).agg({
    'Sales': ['sum', 'mean', 'count'],
    'Profit': 'sum'
}).round(2)

print("\nSales by Product and Region:")
print(summary)

# Find top performing products
top_products = df.groupby('Product')['Sales'].sum().sort_values(ascending=False)
print("\nTop Products by Sales:")
print(top_products)

Overall Summary:
             Sales  Profit_Margin       Profit
count   100.000000     100.000000   100.000000
mean   3092.100000       0.207520   650.862658
std    1148.738126       0.058624   329.404733
min    1001.000000       0.101012   115.668465
25%    2231.000000       0.155376   390.551522
50%    2942.500000       0.214697   552.476850
75%    4278.000000       0.259690   920.467376
max    4908.000000       0.298011  1358.789861

Sales by Product and Region:
                Sales                  Profit
                  sum     mean count      sum
Product Region                               
Laptop  East    30554  3394.89     9  7328.11
        North   18904  3150.67     6  3456.90
        South   30236  3359.56     9  6773.42
        West    30888  3432.00     9  6589.11
Phone   East    34211  2631.62    13  6687.93
        North   19630  2453.75     8  4381.45
        South   16954  2422.00     7  3335.63
        West    21945  2743.12     8  3978.08
Tablet  East    44823  3