# 1.DataFrames

DataFrames is a powerful data structure used in pandas library. Its a 2-dimensional data structure which organizes data ko rows and columns like a spreadsheet/SQL table.


1. Tabular Format: Each column is a Series object and each row is a record.

2. Labeled Axes: Rows and columns of a df can be assigned with labels, which are used to access/manipulate the data.

3. Flexible Data Types: In DataFrames each column can be of a diffrent data type, like integers, floats, strings, etc.

4. Indexing: DataFrames support indexing and slicing, helps in selecting specific rows/columns.

5. Data Operations: DataFrames can be filtered, sorted, aggregated, grouped, aur merged.

## (a)Create a DF

In [1]:
import pandas as pd

# 1. Making df from a dict data
data = {
    'Name': ['Alice', 'Bob', 'Charlie','George', 'Trump'],
    'Age': [25, 30, 35,40, 89],
    'City': ['New York', 'Los Angeles', 'Chicago', 'San Jose', 'Dallas']
}

dict2df = pd.DataFrame(data)
print(dict2df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3   George   40     San Jose
4    Trump   89       Dallas


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# 2. Make df from a list

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]

list2df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(list2df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [3]:
# 3. To read a csv file to df

# csv2df = pd.read_csv('filepath')

## (b)Basic operations with DF

1. **View top and tail** 
- df.head()
- df.tail()

2. **Access column**
- df['column_name']

3. **Access row**
- df.loc[row_index]

4. **Filter row based on a column value**
- df[df['Age'] > 20]]
- Here, df['Age'] > 20 will make the complete column true/false based on given condition.
- Then df[given condition] returns all rows that are true.

5. **Add and drop a column**
- df['column_name'] = 'column_value' # add column
- df.drop(columns=['column_name'])

6. Summary statistics
- df.describe()




In [4]:
# 1. View the head and tail data
print(dict2df.head())  # Top 5 (default) rows
print(dict2df.tail()) # Bottom 5 rows

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3   George   40     San Jose
4    Trump   89       Dallas
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3   George   40     San Jose
4    Trump   89       Dallas


In [5]:
# 2. Access data
print(dict2df['Age'])   # Access age column

0    25
1    30
2    35
3    40
4    89
Name: Age, dtype: int64


In [6]:
# 3. Access a row
print(dict2df.loc[3])  # Access to fourth row 

Name      George
Age           40
City    San Jose
Name: 3, dtype: object


In [7]:
# 4. Filter a row
dict2df[dict2df['Age']>30]

Unnamed: 0,Name,Age,City
2,Charlie,35,Chicago
3,George,40,San Jose
4,Trump,89,Dallas


In [8]:
# 4. Add a column
dict2df['Country'] = 'USA'  
print(dict2df)

      Name  Age         City Country
0    Alice   25     New York     USA
1      Bob   30  Los Angeles     USA
2  Charlie   35      Chicago     USA
3   George   40     San Jose     USA
4    Trump   89       Dallas     USA


In [9]:
# 5. Remove a column
dict2df.drop(columns='Country')

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,George,40,San Jose
4,Trump,89,Dallas


In [10]:
# 6. Summary statistics
dict2df.describe()

Unnamed: 0,Age
count,5.0
mean,43.8
std,25.878563
min,25.0
25%,30.0
50%,35.0
75%,40.0
max,89.0


In [11]:
dict2df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Name     5 non-null      object
 1   Age      5 non-null      int64 
 2   City     5 non-null      object
 3   Country  5 non-null      object
dtypes: int64(1), object(3)
memory usage: 292.0+ bytes


In [12]:
# 7. Shape of df
assert dict2df.shape == (5,4) # Returns size of df in a tuple
assert dict2df.shape[0] == 5 # number of rows
assert dict2df.shape[1] == 4 # number of columns

# 2.Missing values

- When we read a csv/excel etc file in pandas it creates a df (see section DF)

- The empty cells are converted to NaN in df.

- We can also customly define which values to take as NaN.

- We can then Count number of NaN cells, can select columns/rows with NaN present etc.

- Functions used are: isnull(), sum().

- df.notnull() finds non-missing values.

## Check for Missing values

In [13]:
# Example

# Given data
missing_value_dict = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 40],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}

# Create df from dict
missing_value_df = pd.DataFrame(missing_value_dict)

print(missing_value_df)

      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0         None
3     None  40.0      Chicago


In [14]:
# Create true/false table with NaN value
true_false_table = missing_value_df.isnull()
print(true_false_table)


    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3   True  False  False


In [15]:
# Check if any column has a missing value or not
true_false_table.any() 

Name    True
Age     True
City    True
dtype: bool

In [16]:
# Caluclate number of NaN in each column if any
null_count = true_false_table.sum()
print(null_count)

Name    1
Age     1
City    1
dtype: int64


## Handling missing data

### 1. Approaches to Handle Missing Data

1. Deleting Missing Data:

**Listwise Deletion**: Remove entire rows with any missing values.

> - Pros: Simple and easy to implement.
> - Cons: Loss of data, which can lead to bias if the missing data is not MCAR (Missing Completely at Random).

**Pairwise Deletion**: Only remove rows with missing values for specific analyses.

> - Pros: Retains more data for analyses where the missingness doesn't affect all variables.
> - Cons: Can lead to inconsistencies in analysis results.

2. Imputation Methods:

**Simple Imputation:**- 
- Mean/Median Imputation: Replace missing values with the mean or median of the column.
> * Pros: Easy to implement.
> * Cons: Can distort variance and relationships.

- Mode Imputation: Replace missing values with the mode, useful for categorical data.
> * Pros: Maintains the distribution for categorical variables.
> * Cons: Can over-represent the mode.

**Predictive Imputation:**
- Regression Imputation: Use regression models to predict missing values based on other variables.
>- Pros: More accurate as it uses existing relationships.
>- Cons: Can lead to overfitting if not done carefully.

- K-Nearest Neighbors (KNN) Imputation: Use similar data points (neighbors) to impute missing values.
>- Pros: Captures local relationships well.
>- Cons: Computationally expensive for large datasets.

**Advanced Imputation:**
- Multiple Imputation: Generate several imputed datasets and combine results.
>- Pros: Accounts for uncertainty and provides more robust estimates.
>- Cons: More complex to implement.

- Machine Learning Models: Use models like random forests for imputation.
>- Pros: Can capture complex patterns.
>- Cons: Requires careful tuning and validation.


3. Model-Based Approaches:

**Maximum Likelihood Estimation**: Incorporate missing data directly into model estimation.
>- Pros: Theoretically optimal and uses all available data.
>- Cons: Complex to implement for non-standard models.

4. Using Indicators:

- Missingness Indicators: Add binary indicators to show where data is missing.
>- Pros: Allows modeling of missingness itself as a feature.
>- Cons: Increases complexity of the model.

### 2. Decision Criteria for Selecting an Approach

1. Nature of Missingness:

- MCAR (Missing Completely at Random): Missing data has no relation to other data or itself. Simple methods can be sufficient.
- MAR (Missing at Random): Missing data is related to other observed variables. Predictive imputation is often needed.
- MNAR (Missing Not at Random): Missing data depends on unobserved data. Requires careful analysis and possibly model-based approaches.

2. Data Characteristics:

- Size of the Dataset: Large datasets may tolerate simple methods due to less impact from missingness.
- Data Type: Categorical vs. continuous data may require different imputation strategies.

3. Analysis Goals:

- Exploratory Analysis: Simpler methods may be acceptable.
- Predictive Modeling: Preserving relationships and variance is crucial, often necessitating more sophisticated imputation.

4. Percentage of Missing Data:

- Low Percentage: Simple imputation (mean, median, mode) may be adequate.
- High Percentage: Consider more robust methods like multiple imputation or deletion strategies if the data cannot be reliably imputed.

5. Importance of Variables:

- Critical Variables: More effort and sophisticated methods should be used for important variables.

6. Computational Resources:

- Complexity vs. Simplicity: Weigh the benefits of sophisticated imputation against computational cost and complexity.

7. Validation and Experimentation:

- Cross-Validation: Validate the impact of different methods on model performance.
- Sensitivity Analysis: Assess how sensitive your results are to the imputation method chosen.  

By considering these approaches and decision criteria, you can choose a suitable method for handling missing data tailored to your dataset and analysis goals.










(a) Remove missing values

In [17]:
# 1. 

print(missing_value_df.dropna())  # remove all rows with na

    Name   Age      City
0  Alice  25.0  New York


In [18]:
# 2. 

missing_value_df.dropna(subset=['Age'])  # Remove rows with NA at specific column 

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
2,Charlie,35.0,
3,,40.0,Chicago


In [19]:
# 3. 

missing_value_df.dropna(axis=1)  # axis = 0 for row, 1 for column. 

0
1
2
3


In [20]:
# 4.

missing_value_df.dropna(thresh=2)  # Keep rows with atleast 2 non-Nan values
                                   # Remove rows with min mentioned number of NaN

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,Los Angeles
2,Charlie,35.0,
3,,40.0,Chicago


(b)Filling missing values

In [21]:
# 1.
missing_value_df.fillna(1000)  # Fill with specific value

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,1000.0,Los Angeles
2,Charlie,35.0,1000
3,1000,40.0,Chicago


In [22]:
# 2.
print(missing_value_df)
missing_value_df.fillna(method='ffill')  # Propagate previous value

      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0         None
3     None  40.0      Chicago


  missing_value_df.fillna(method='ffill')  # Propagate previous value


Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,25.0,Los Angeles
2,Charlie,35.0,Los Angeles
3,Charlie,40.0,Chicago


In [23]:
# 3.
missing_value_df.fillna(method='bfill')  # Propagate next value

  missing_value_df.fillna(method='bfill')  # Propagate next value


Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,35.0,Los Angeles
2,Charlie,35.0,Chicago
3,,40.0,Chicago


In [24]:
# 4. 
missing_value_df.fillna({'Name': 'anonymous', 'Age': '00', 'City': 'NEW York'})  # Specify fillers column wise

Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,0.0,Los Angeles
2,Charlie,35.0,NEW York
3,anonymous,40.0,Chicago


(c)Interpolation

In [25]:
missing_value_df.interpolate()  # Interpolates based on existing values

  missing_value_df.interpolate()  # Interpolates based on existing values


Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,30.0,Los Angeles
2,Charlie,35.0,
3,,40.0,Chicago


(d)Using Imputation Libraries
sklearn.impute.SimpleImputer: Provides advanced impmutation methods like mean, median,or constant value imputation.

In [26]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
missing_value_df[['Age']] = imputer.fit_transform(missing_value_df[['Age']])
missing_value_df


Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,33.333333,Los Angeles
2,Charlie,35.0,
3,,40.0,Chicago


# 3.Read files
- Read a xlsx file which has multiple sheets.

```python
rawdata = pd.read_excel('Raw_data.xlsx')  # This will read first sheet only by default
```

- To read other sheets too, we need to mention sheet_name:
```python
rawdata = pd.read_excel('Raw_data.xlsx', sheet_name = 'sheet_name')  # This will read 'sheet_name' sheet


# 4. Pandas Series vs Python List

>- Pandas series is like a python list but with customized index.
>- Python list has 0-based indexing.

>- Pandas series has all homogenous datatype elements.
>- Pyhton list can have heterogenous datatype elements.


In [4]:
# Pandas series
import pandas as pd
s: pd.Series = pd.Series([1,2,3,4], index=['a','b','c','d'])
print(s)

a    1
b    2
c    3
d    4
dtype: int64


In [9]:
# Python list
from typing import List
l: List = [1,2,3,4]
print(l)

[1, 2, 3, 4]


# 5. datetime module

- To handle date and times, provides various combinations

**Main classes in 'datetime'**

- **date**: Represents a date (year, month, and day).
- **time**: Represents a time (hour, minute, second, and microsecond).
- **datetime**: Combines date and time.
- **timedelta**: Represents the difference between two dates or times.

### (a) Get current date/time/datetime

In [11]:
# Create date object

import datetime 

2024-08-07


In [None]:
# Get today date
today = datetime.date.today()  
print(today)  # Current date

In [24]:
# Get today date and time
today_datetime = datetime.datetime.now()
print(today_datetime)

2024-08-07 12:27:53.820394


In [29]:
# Get current time
current_time = today_datetime.time()
print(current_time)

12:27:53.820394


### (b) Create specific date/time/datetime

In [33]:
# Create a specific date
specific_date = datetime.date(2024,8,11)
print(specific_date)

# Create a specific time object
specific_time = datetime.time(12,15,12)
print(specific_time)

# Create a specific datetime
specific_datetime = datetime.datetime(2024,8,11, 12,15,12)
print(specific_datetime)

2024-08-11
12:15:12
2024-08-11 12:15:12


## (c) timedelta

In [39]:
# Create a timedelta object
delta = datetime.timedelta(days=5, hours=3, minutes=30)
print(delta)

# Add timdelta to a date or datetime
future_date = datetime.datetime.now() + delta
print(future_date)

# Count datetime for a date
days_left = datetime.datetime(2024, 8, 11, 00, 00, 00) - datetime.datetime.now()
print(days_left)

5 days, 3:30:00
2024-08-12 16:08:15.815979
3 days, 11:21:44.183396


## (d) Formatting dates and times
- common codes for formatting datetime

> - %Y: Year with century (e.g., 2024)

> - %m: Month as a zero-padded decimal (e.g., 08)

> - %d: Day of the month as a zero-padded decimal (e.g., 07)

> - %H: Hour (00 to 23)

> - %M: Minute (00 to 59)

> - %S: Second (00 to 59)


### 1. Format datetime to string

In [49]:
# Format datetime to string

formatted_date1 = datetime.date.today().strftime("%Y-%m-%d") 
                                                            
print(formatted_date)

formatted_date2 = datetime.date.today().strftime("%y-%m-%d")
print(formatted_date2)

2024-08-07
24-08-07


### 2. Format string to datetime - Parsing

In [51]:
# Parsing a string into a datetime object
date_str = "2024-08-07"
parsed_date = datetime.datetime.strptime(date_str, "%Y-%m-%d")
print(parsed_date)

2024-08-07 00:00:00


###