#**Week-8 Assignment**
##**Minute Steps Narrow**
### *By Arijit Dhali [Linkedin](https://www.linkedin.com/in/arijit-dhali-b255b0138/)*
---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Minute Steps Narrow`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [13]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [14]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load car-related dataset from URL into 'df' DataFrame
url = 'https://raw.githubusercontent.com/ArijitDhali/PrepInsta-DA-Week-8/main/Raw%20Files/minuteStepsNarrow_merged.csv'
df = pd.read_csv(url, encoding='unicode_escape')

# Display first two rows of the loaded DataFrame
df.head(2)

Unnamed: 0,Id,ActivityMinute,Steps
0,1503960366,4/12/2016 12:00:00 AM,0
1,1503960366,4/12/2016 12:01:00 AM,0


# **Preliminary Data Inspection**

To perform any operation on the dataset, we need to get familiarized with the dataset.

Showcasing the data types of each column in the '`df`' DataFrame.

In [15]:
df.dtypes        # Display data types of columns in the 'df' DataFrame

Id                 int64
ActivityMinute    object
Steps              int64
dtype: object

Detailed information about the '`df`' DataFrame, including *data types* and *memory usage*

In [16]:
df.info(verbose=True)        # Display concise information about 'df' DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1325580 entries, 0 to 1325579
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   Id              1325580 non-null  int64 
 1   ActivityMinute  1325580 non-null  object
 2   Steps           1325580 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 30.3+ MB


Showing the number of rows and columns

In [17]:
df.shape         # Display the shape (rows, columns)

(1325580, 3)

Using `df.describe()` to generate descriptive statistics for the numerical columns in the '`df`' DataFrame.

In [18]:
df.describe()      # Generate descriptive statistics for numerical columns

Unnamed: 0,Id,Steps
count,1325580.0,1325580.0
mean,4847897691.86,5.34
std,2422313222.28,18.13
min,1503960366.0,0.0
25%,2320127002.0,0.0
50%,4445114986.0,0.0
75%,6962181067.0,0.0
max,8877689391.0,220.0


# **Data Cleaning and Trimming**

Modifying each and every column accordingly to get a smooth analysis in data vizualization.

### 1. Standardize Date Column

Here we convert the '`Date`' column to datetime format. Now we extract the date only from '`Date`' and create a new columnrelated to the name of Dataframe. Now we drop the original '`Date`' column and move the new column to its original position. Finally, we print the DataFrame to observe the changes.

In [19]:
# Convert the 'Date' column to datetime format
df['ActivityMinute'] = pd.to_datetime(df['ActivityMinute'], format='%m/%d/%Y %I:%M:%S %p')

# Convert the 'ActivityMinute' column to datetime format
df['ActivityDay'] = df['ActivityMinute'].dt.date
df['ActivityMinute'] = df['ActivityMinute'].dt.time
date_column = df.pop('ActivityDay')
df.insert(1, 'ActivityDay', date_column)

# Print the DataFrame to see the changes
df.head(5)

Unnamed: 0,Id,ActivityDay,ActivityMinute,Steps
0,1503960366,2016-04-12,00:00:00,0
1,1503960366,2016-04-12,00:01:00,0
2,1503960366,2016-04-12,00:02:00,0
3,1503960366,2016-04-12,00:03:00,0
4,1503960366,2016-04-12,00:04:00,0


### 2. Handling the large missing values

Now, we will calculate and display the percentage of missing values for each column in the '`df`' DataFrame. <br>This information helps in understanding the completeness of the dataset and identifies columns with **missing data**.

In [20]:
# Calculate the percentage of missing values for each column in 'df' DataFrame
row_size = df.shape[0]
for i in df.columns:
    if df[i].isnull().sum() > 0:
        print(i, "----------", (df[i].isnull().sum() / row_size) * 100)

 checking for and printing the number of duplicate rows in the '`df`' DataFrame, helping to identify and address potential data duplication issues.

In [21]:
# Check for duplicate rows in 'car' DataFrame
df.duplicated().sum()

0

Remove the duplicate rows from the '`df`' DataFrame

In [22]:
# Drop duplicate rows from the 'car' DataFrame
df = df.drop_duplicates()
df.duplicated().sum()

0

#**Viewing & Saving Clean Data**

Now we sort the DataFrame.

In [23]:
# Sort the melted DataFrame by 'Id' and 'Activity'
df = df.sort_values(by=['Id', 'ActivityDay', 'ActivityMinute'])

Viewing the final and cleaned data, saving it into `.csv` format

Here, the modified '`df`' DataFrame is saved to a **CSV file** named '`minute_steps_cleaned.csv`' without including the index column.

In [24]:
df.to_csv('minute_steps_cleaned.csv', index=False)   # Save the modified 'df' DataFrame to a CSV file named 'minute_steps_cleaned.csv'