### Day 1: Introduction to Python for Data Science

Welcome to Day 1! In today’s session, we’ll cover some essential concepts and techniques you’ll use in Python for data science.

**Overview of the Topics for Today:**
1. Introduction to Python for Data Science.
2. Data manipulation with Pandas:
   - Data cleaning and transformation.
   - Grouping, merging, and aggregating data.
3. Introduction to NumPy:
   - Array creation and manipulation.
   - Vectorized operations for efficient computation.

Let’s get started!


# Install the libraries using pip if not already installed

In [1]:

!pip install pandas numpy matplotlib
# This command uses `!pip install` to install the necessary libraries
# for data manipulation (`pandas`), numerical calculations (`numpy`)


Defaulting to user installation because normal site-packages is not writeable


### Import Libraries
In this cell, we will import the required libraries. The following libraries are crucial for our project:
- `pandas`: for data manipulation.
- `numpy`: for numerical operations.


In [2]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt


### Data Manipulation with Pandas: Data Cleaning and Transformation

One of the first steps in data science is to clean and transform your data. We will use **Pandas** for that purpose, which provides robust tools for data manipulation.

**Data Cleaning Tasks:**
- Handling missing values.
- Converting data types.
- Removing duplicates.
- Filtering data.


### Data set link : <br>
- https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset

In [36]:
# Sample DataFrame
data = pd.read_csv("Batting\\ODI data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,Unnamed: 13
0,0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20,
1,1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,
2,2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,
3,3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,
4,4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28,


In [38]:
df.isnull().sum()

Unnamed: 0        0
Player            0
Span              0
Mat               0
Inns              0
NO                0
Runs              0
HS                0
Ave               0
BF                0
SR                0
100               0
50                0
0                 0
Unnamed: 13    2500
dtype: int64

### Find different exsting columns

In [4]:
df.columns

Index(['Administrative', 'Administrative_Duration', 'Informational',
       'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration',
       'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month',
       'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType',
       'Weekend', 'Revenue'],
      dtype='object')

# Dataset Overview

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

The dataset consists of **12,330 rows** and **18 columns**, each representing various aspects of user sessions on an e-commerce website. Here's an overview of the dataset's structure and key features:

---

## **1. Dataset Summary**
- **Total Rows**: 12,330 (each row represents a unique user session).
- **Total Columns**: 18 (features describing session characteristics).
- **No Missing Values**: All columns are completely filled.

---

## **2. Column Details**
| **#** | **Column Name**            | **Data Type** | **Description**                                                                 |
|-------|----------------------------|---------------|---------------------------------------------------------------------------------|
| 0     | `Administrative`           | int64         | Number of administrative pages visited during the session.                      |
| 1     | `Administrative_Duration`  | float64       | Time spent on administrative pages (seconds).                                   |
| 2     | `Informational`            | int64         | Number of informational pages visited during the session.                       |
| 3     | `Informational_Duration`   | float64       | Time spent on informational pages (seconds).                                    |
| 4     | `ProductRelated`           | int64         | Number of product-related pages visited.                                        |
| 5     | `ProductRelated_Duration`  | float64       | Time spent on product-related pages (seconds).                                  |
| 6     | `BounceRates`              | float64       | Bounce rate of the session (proportion of single-page visits).                  |
| 7     | `ExitRates`                | float64       | Exit rate of the session (proportion of exits from the website).                |
| 8     | `PageValues`               | float64       | Perceived value of the page (calculated from e-commerce data).                  |
| 9     | `SpecialDay`               | float64       | Closeness of the session to a special day (e.g., holidays).                     |
| 10    | `Month`                    | object        | Month of the year during the session.                                           |
| 11    | `OperatingSystems`         | int64         | ID of the operating system used during the session.                             |
| 12    | `Browser`                  | int64         | ID of the browser used during the session.                                      |
| 13    | `Region`                   | int64         | Geographic region of the session.                                               |
| 14    | `TrafficType`              | int64         | Source of website traffic (e.g., paid ads, referrals).                          |
| 15    | `VisitorType`              | object        | Type of user: "Returning_Visitor", "New_Visitor", or "Other".                   |
| 16    | `Weekend`                  | bool          | Whether the session took place on a weekend (`True` or `False`).                |
| 17    | `Revenue`                  | bool          | Whether the session resulted in a revenue transaction (`True` or `False`).      |

---

## **3. Data Types**
- **int64 (7 columns)**: Integer numeric values for counts or identifiers.
- **float64 (7 columns)**: Continuous numeric values (durations, rates, or proportions).
- **object (2 columns)**: Categorical text data (e.g., `Month`, `VisitorType`).
- **bool (2 columns)**: Binary values (`True`/`False`).

---

## **4. Memory Usage**
- **Approximate Memory Size**: ~1.5 MB.
- Efficient storage due to optimized data types (`bool` and `int64`).

---

## **5. Insights for Analysis**
- **Behavioral Patterns**: Features like `BounceRates` and `ExitRates` help understand user activity.
- **Engagement Metrics**: Durations (`Administrative_Duration`, `ProductRelated_Duration`) capture user interaction levels.
- **Predictive Features**: Categorical and temporal data like `TrafficType` and `Month` may predict revenue outcomes.

---

This dataset is ready for exploratory data analysis and predictive modeling in e-commerce analytics.


# Descriptive Statistics for Dataset

This section provides an overview of the key descriptive statistics for the dataset. These statistics summarize the distribution, variability, and central tendencies of the features related to user sessions.

In [6]:
df.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0



## **1. Explanation of Statistical Metrics**
| Metric         | Description                                                                                     |
|----------------|-------------------------------------------------------------------------------------------------|
| **Count**      | Number of non-null observations for each column.                                                |
| **Mean**       | Average value for each column. Indicates the central tendency.                                  |
| **Std**        | Standard deviation. Measures the spread or variability of the data from the mean.               |
| **Min**        | Minimum value in the column.                                                                    |
| **25%, 50%, 75%** | Quartiles that split the data into four equal parts. Helps understand data distribution.       |
| **Max**        | Maximum value in the column.                                                                    |

---

## **2. Statistical Summary**
Below is the detailed statistical summary for each column in the dataset.

| Feature                    | **Mean** | **Std**   | **Min** | **25%** | **50% (Median)** | **75%** | **Max**     | **Description**                                                                 |
|----------------------------|----------|-----------|---------|---------|------------------|---------|-------------|---------------------------------------------------------------------------------|
| **Administrative**         | 2.31     | 3.32      | 0       | 0       | 1                | 4       | 27          | Average number of administrative pages visited per session.                     |
| **Administrative_Duration**| 80.81    | 176.77    | 0       | 0       | 7.5              | 93.25   | 3398.75     | Average time spent on administrative pages (seconds).                           |
| **Informational**           | 0.50     | 1.27      | 0       | 0       | 0                | 0       | 24          | Average number of informational pages visited per session.                      |
| **Informational_Duration** | 34.47    | 140.74    | 0       | 0       | 0                | 0       | 2549.37     | Average time spent on informational pages (seconds).                            |
| **ProductRelated**          | 31.73    | 44.47     | 0       | 7       | 18               | 38      | 705         | Average number of product-related pages visited per session.                    |
| **ProductRelated_Duration**| 1194.74  | 1913.66   | 0       | 184.13  | 598.93           | 1464.16 | 63973.52    | Average time spent on product-related pages (seconds).                          |
| **BounceRates**             | 0.022    | 0.048     | 0       | 0       | 0.0031           | 0.0168  | 0.2         | Proportion of single-page visits out of total sessions.                         |
| **ExitRates**               | 0.043    | 0.048     | 0       | 0.0142  | 0.0251           | 0.0500  | 0.2         | Proportion of page exits.                                                       |
| **PageValues**              | 5.88     | 18.56     | 0       | 0       | 0                | 0       | 361.76      | Calculated value of each page during user sessions.                             |
| **SpecialDay**              | 0.061    | 0.199     | 0       | 0       | 0                | 0       | 1           | Proximity to special days (closer to 1 indicates nearer to a holiday or event). |
| **OperatingSystems**        | 2.12     | 0.91      | 1       | 2       | 2                | 3       | 8           | ID of the operating system used.                                                |
| **Browser**                 | 2.35     | 1.71      | 1       | 2       | 2                | 2       | 13          | ID of the browser used.                                                         |
| **Region**                  | 3.14     | 2.40      | 1       | 1       | 3                | 4       | 9           | Geographic region of the session.                                               |
| **TrafficType**             | 4.06     | 4.02      | 1       | 2       | 2                | 4       | 20          | Source of website traffic.                                                      |

---

## **3. Observations**
1. **Administrative Behavior**: 
   - Majority of the sessions involve only 0 to 4 visits to administrative pages (median = 1), with some outliers going as high as 27 visits.
   - Time spent on administrative pages is low for most users, but extreme values can be as high as 3,398 seconds.

2. **Informational Usage**:
   - Very few informational pages are visited on average (median = 0), indicating users are more focused on product or other page types.

3. **ProductRelated Activity**:
   - Product-related pages have the highest engagement. The mean is 31.73, and the max reaches 705 pages per session.
   - Product-related durations also show significant variability with long engagement times (max = 63,973 seconds).

4. **Bounce and Exit Rates**:
   - These metrics are relatively low, indicating users interact more extensively than exiting early.

5. **Page Value**:
   - Most pages have a perceived value of 0 (median = 0), but there are outliers with very high calculated values (max = 361.76).

6. **Special Days**:
   - Most sessions are not close to any special days (`SpecialDay` median = 0).

This statistical overview highlights the session dynamics, providing insights into user interaction with the website. 


### Creating new columns for better readability

In [7]:
df = df.rename(columns={
    'Administrative_Duration': 'AdminDuration',
    'Informational_Duration': 'InfoDuration',
    'ProductRelated_Duration': 'ProductDuration'
})


In [8]:
df.columns

Index(['Administrative', 'AdminDuration', 'Informational', 'InfoDuration',
       'ProductRelated', 'ProductDuration', 'BounceRates', 'ExitRates',
       'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser',
       'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue'],
      dtype='object')

In [9]:
# Separate numerical and categorical columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
categorical_columns = df.select_dtypes(include=['object', 'bool']).columns

# Display results
print("Numerical Columns:", numerical_columns.tolist())



Numerical Columns: ['Administrative', 'AdminDuration', 'Informational', 'InfoDuration', 'ProductRelated', 'ProductDuration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'OperatingSystems', 'Browser', 'Region', 'TrafficType']


In [10]:
print("Categorical Columns:", categorical_columns.tolist())

Categorical Columns: ['Month', 'VisitorType', 'Weekend', 'Revenue']


### Target Variable: Revenue
#### Reason:
- The Revenue column is a Boolean column (True/False), likely indicating whether a visitor generated revenue (made a purchase) or not.
- For many e-commerce datasets like this, predicting or analyzing revenue generation is a common goal.

In [11]:
df["Revenue"].value_counts()

Revenue
False    10422
True      1908
Name: count, dtype: int64

In [12]:
df[df["Revenue"] == False].groupby("Weekend")["ExitRates"].sum()

Weekend
False    395.306431
True      98.469905
Name: ExitRates, dtype: float64

In [15]:
df[df["Revenue"] == False].groupby("Weekend")["ExitRates"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001ACB1929CD0>

In [24]:
df[df["Revenue"] == False].head()  # Filtered data

Unnamed: 0,Administrative,AdminDuration,Informational,InfoDuration,ProductRelated,ProductDuration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False


In [23]:
df[df["Weekend"] == False].head()

Unnamed: 0,Administrative,AdminDuration,Informational,InfoDuration,ProductRelated,ProductDuration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
5,0,0.0,0,0.0,19,154.216667,0.015789,0.024561,0.0,0.0,Feb,2,2,1,3,Returning_Visitor,False,False


In [32]:
# Filter rows where Weekend is False AND Revenue is False
df[(df["Weekend"] == False) & (df["Revenue"] == False)]

Unnamed: 0,Administrative,AdminDuration,Informational,InfoDuration,ProductRelated,ProductDuration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0,0.00,0,0.0,1,0.000000,0.200000,0.200000,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0,0.00,0,0.0,2,64.000000,0.000000,0.100000,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0,0.00,0,0.0,1,0.000000,0.200000,0.200000,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0,0.00,0,0.0,2,2.666667,0.050000,0.140000,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
5,0,0.00,0,0.0,19,154.216667,0.015789,0.024561,0.0,0.0,Feb,2,2,1,3,Returning_Visitor,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12321,0,0.00,0,0.0,6,0.000000,0.200000,0.200000,0.0,0.0,Nov,1,8,4,1,Returning_Visitor,False,False
12322,6,76.25,0,0.0,22,1075.250000,0.000000,0.004167,0.0,0.0,Dec,2,2,4,2,Returning_Visitor,False,False
12323,2,64.75,0,0.0,44,1157.976190,0.000000,0.013953,0.0,0.0,Nov,2,2,1,10,Returning_Visitor,False,False
12324,0,0.00,1,0.0,16,503.000000,0.000000,0.037647,0.0,0.0,Nov,2,2,1,1,Returning_Visitor,False,False


#### Group by 'Weekend' (though in this case, all rows in filtered_data will have Weekend=False)
#### Perform an aggregation on 'ExitRates'

In [30]:
df[(df["Weekend"] == False) & (df["Revenue"] == False)].groupby("Weekend")["ExitRates"].mean()

Weekend
False    0.049088
Name: ExitRates, dtype: float64

In [None]:
print(df.groupby("Weekend").groups)  # Group information

In [13]:
df[df["Revenue"]].groupby("Weekend")["ExitRates"].mean()

Weekend
False    0.019938
True     0.018476
Name: ExitRates, dtype: float64

In [None]:
weekend = df.groupby("Weekend")
weekend

In [None]:
# Dictionary to store results
column_variations = {}

# Calculate mean for numerical columns grouped by Weekend when Revenue is False
for col in numerical_columns:
    variation = df[df["Revenue"] == False].groupby("Weekend")[col].mean()
    column_variations[col] = variation

# Display results
for col, var in column_variations.items():
    print(f"Variation in {col} for Revenue == False grouped by Weekend:")
    print(var)
    print("\n")


## Find the Average exit per month

In [None]:
# Dictionary to store results
column_variations = {}

# Calculate mean for numerical columns grouped by Weekend when Revenue is False
for col in numerical_columns:
    variation = df.groupby("Month")[col].mean()
    column_variations[col] = variation

# Display results
for col, var in column_variations.items():
    print(f"Variation in {col} for Revenue == False grouped by Month:")
    print(var)
    print("\n")
average_exit_rate = df.groupby('Month')['ExitRates'].mean()
average_exit_rate

# Add a new column for total interaction time

In [None]:
df['TotalInteraction'] = df['AdminDuration'] + df['InfoDuration'] + df['ProductDuration']

In [None]:
df["TotalInteraction"].describe()

### Filter rows where Revenue is True

In [None]:
revenue_data = df[df['Revenue'] == True]

# Preview updated data
revenue_data.head()

In [None]:
no_revenue_data = df[df['Revenue'] == False]

# Preview updated data
no_revenue_data.head()

## Grouping and Aggregating Data

With Pandas, we can easily group data based on categories and calculate summaries. Examples:
- Average `BounceRates` per `Month`.
- Total revenue transactions grouped by `VisitorType`.


In [None]:
# Group by Month to calculate mean BounceRates
month_group = df.groupby('Month')['BounceRates'].mean()

# Group by VisitorType for total transactions that generated Revenue
visitor_revenue = df.groupby('VisitorType')['Revenue'].sum()

# Display results
month_group, visitor_revenue


In [None]:
# Grouping by 'Weekend' and aggregating
grouped_data = df.groupby("Weekend").agg({
    "ExitRates": "mean",
    "PageValues": "mean"
}).reset_index()

# Merging the aggregated data back into the original DataFrame
merged_data = pd.merge(df, grouped_data, on="Weekend", suffixes=("", "_GroupedMean"))

# Display the result
merged_data.head()


## Introduction to NumPy: Array Creation and Operations

We'll use **NumPy** for:
1. Creating arrays from dataset values (e.g., `BounceRates`).
2. Performing vectorized operations to calculate new metrics.


In [None]:
import numpy as np

# Create an array of BounceRates
bounce_rates = np.array(df['BounceRates'])

# Perform vectorized operations: scale BounceRates by 100
scaled_bounce_rates = bounce_rates * 100

# Create a new NumPy array for high BounceRates (above 50%)
high_bounce = scaled_bounce_rates[scaled_bounce_rates > 50]

scaled_bounce_rates[:10], high_bounce[:10]


### **Step 1: Install Git on Windows**
1. Download Git from [git-scm.com](https://git-scm.com/).
2. Run the installer:
   - During the installation, select the default options unless you have specific preferences.
   - Choose "Git Bash Here" for right-click context menus (optional but helpful).
3. After installation, open **Git Bash** or **Command Prompt** to confirm the installation:
   ```bash
   git --version
   ```
   You should see the installed version of Git.

---

### **Step 2: Set Up Your Git Environment**
1. Configure your username and email:
   ```bash
   git config --global user.name "Your Name"
   git config --global user.email "your.email@example.com"
   ```

2. Verify the configuration:
   ```bash
   git config --list
   ```

---

### **Step 3: Initialize a Local Git Repository**
1. Open Git Bash or Command Prompt.
2. Navigate to your project folder:
   ```bash
   cd /path/to/your/project
   ```
3. Initialize the repository:
   ```bash
   git init
   ```

---

### **Step 4: Stage and Commit Files**
1. Add all the files to the staging area:
   ```bash
   git add .
   ```
2. Commit the files with a message:
   ```bash
   git commit -m "Initial commit"
   ```

---

### **Step 5: Create a Repository on GitHub**
1. Go to [GitHub](https://github.com/) and log in.
2. Click the **"+"** icon in the top-right corner and select **New repository**.
3. Name your repository and click **Create repository**.

---

### **Step 6: Link Your Local Repository to GitHub**
1. Copy the repository's HTTPS URL from GitHub (e.g., `https://github.com/username/repo.git`).
2. Link the local repository to the remote:
   ```bash
   git remote add origin https://github.com/username/repo.git
   ```

3. Push your changes to GitHub:
   ```bash
   git branch -M main
   git push -u origin main
   ```

---

### **Verification**
Go to your GitHub repository page, and you should see your files uploaded. 🎉

Let me know if you encounter any issues!

## Conclusion

In this notebook, we:
1. Explored the dataset structure and cleaned/renamed columns.
2. Performed data manipulation with Pandas: grouping and aggregating.
3. Used NumPy for numerical operations on the dataset.

Next steps: Explore **visualizations** and integrate **machine learning models**.

---
End of Day 1.
