#**Week-8 Assignment**
##**Sleep Day**

---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Sleep Day`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [1]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [2]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load car-related dataset from URL into 'df' DataFrame
url = 'https://raw.githubusercontent.com/vignay21/Prepinsta-Winter-Internship/main/PrepInsta-Week8/Raw%20CSV/sleepDay_merged.csv'
df = pd.read_csv(url, encoding='unicode_escape')

# Display first two rows of the loaded DataFrame
df.head(2)

Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
0,1503960366,4/12/2016 12:00:00 AM,1,327,346
1,1503960366,4/13/2016 12:00:00 AM,2,384,407


# **Preliminary Data Inspection**

To perform any operation on the dataset, we need to get familiarized with the dataset.

Showcasing the data types of each column in the '`df`' DataFrame.

In [3]:
df.dtypes        # Display data types of columns in the 'df' DataFrame

Id                     int64
SleepDay              object
TotalSleepRecords      int64
TotalMinutesAsleep     int64
TotalTimeInBed         int64
dtype: object

Detailed information about the '`df`' DataFrame, including *data types* and *memory usage*

In [4]:
df.info(verbose=True)        # Display concise information about 'df' DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 413 entries, 0 to 412
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Id                  413 non-null    int64 
 1   SleepDay            413 non-null    object
 2   TotalSleepRecords   413 non-null    int64 
 3   TotalMinutesAsleep  413 non-null    int64 
 4   TotalTimeInBed      413 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 16.3+ KB


Showing the number of rows and columns

In [5]:
df.shape         # Display the shape (rows, columns)

(413, 5)

Using `df.describe()` to generate descriptive statistics for the numerical columns in the '`df`' DataFrame.

In [6]:
df.describe()      # Generate descriptive statistics for numerical columns

Unnamed: 0,Id,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
count,413.0,413.0,413.0,413.0
mean,5000979403.21,1.12,419.47,458.64
std,2060360173.74,0.35,118.34,127.1
min,1503960366.0,1.0,58.0,61.0
25%,3977333714.0,1.0,361.0,403.0
50%,4702921684.0,1.0,433.0,463.0
75%,6962181067.0,1.0,490.0,526.0
max,8792009665.0,3.0,796.0,961.0


# **Data Cleaning and Trimming**

Modifying each and every column accordingly to get a smooth analysis in data vizualization.

### 1. Handling the large missing values

Now, we will calculate and display the percentage of missing values for each column in the '`df`' DataFrame. <br>This information helps in understanding the completeness of the dataset and identifies columns with **missing data**.

In [7]:
# Calculate the percentage of missing values for each column in 'df' DataFrame
row_size = df.shape[0]
for i in df.columns:
    if df[i].isnull().sum() > 0:
        print(i, "----------", (df[i].isnull().sum() / row_size) * 100)

 checking for and printing the number of duplicate rows in the '`df`' DataFrame, helping to identify and address potential data duplication issues.

In [8]:
# Check for duplicate rows in 'car' DataFrame
df.duplicated().sum()

3

Remove the duplicate rows from the '`df`' DataFrame

In [9]:
# Drop duplicate rows from the 'car' DataFrame
df = df.drop_duplicates()
df.duplicated().sum()

0

### 2. Remove Unnecessary Columns

Here we specify the column names to drop. Now we drop these specified columns from the DataFrame.

In [10]:
# Column names to drop
#columns_to_drop = []

# Drop the specified columns
#df = df.drop(columns=columns_to_drop)

In [11]:
#df.sample()

### 3. Standardize Date Column

Here we convert the '`Date`' column to datetime format. Now we extract the date only from '`Date`' and create a new columnrelated to the name of Dataframe. Now we drop the original '`Date`' column and move the new column to its original position. Finally, we print the DataFrame to observe the changes.

In [12]:
# Convert the 'Date' column to datetime format
df['SleepDay'] = pd.to_datetime(df['SleepDay'], format='%m/%d/%Y %I:%M:%S %p')

# Convert the 'Date' column to date only
df['SleepDay'] = df['SleepDay'].dt.strftime('%Y-%m-%d')

# Print the DataFrame to see the changes
df.sample(5)

Unnamed: 0,Id,SleepDay,TotalSleepRecords,TotalMinutesAsleep,TotalTimeInBed
230,5553957443,2016-04-14,1,357,418
269,5577150313,2016-04-22,1,338,366
84,3977333714,2016-04-15,1,424,556
199,4558609924,2016-05-08,1,123,134
18,1503960366,2016-05-05,1,247,264


#**Viewing & Saving Clean Data**

Here we melt the original DataFrame to transform it from wide to long format. Now we sort the melted DataFrame. Now we pivot the sorted melted DataFrame to reshape it back to wide format. Finally, we print the pivoted DataFrame.

In [13]:
# Melt the original DataFrame to convert it from wide to long format
melted_df = df.melt(id_vars=['Id', 'SleepDay'], value_vars=['TotalSleepRecords', 'TotalMinutesAsleep', 'TotalTimeInBed'], var_name='Variable', value_name='Value')

# Sort the melted DataFrame by 'Id' and 'SleepDay'
sorted_df = melted_df.sort_values(by=['Id', 'SleepDay'])

# Pivot the sorted melted DataFrame
pivoted_df = sorted_df.pivot_table(index=['Id', 'SleepDay'], columns='Variable', values='Value', aggfunc='first').reset_index()

# Print the pivoted DataFrame
print(pivoted_df)

Variable          Id    SleepDay  TotalMinutesAsleep  TotalSleepRecords  \
0         1503960366  2016-04-12                 327                  1   
1         1503960366  2016-04-13                 384                  2   
2         1503960366  2016-04-15                 412                  1   
3         1503960366  2016-04-16                 340                  2   
4         1503960366  2016-04-17                 700                  1   
..               ...         ...                 ...                ...   
405       8792009665  2016-04-30                 343                  1   
406       8792009665  2016-05-01                 503                  1   
407       8792009665  2016-05-02                 415                  1   
408       8792009665  2016-05-03                 516                  1   
409       8792009665  2016-05-04                 439                  1   

Variable  TotalTimeInBed  
0                    346  
1                    407  
2                 

Viewing the final and cleaned data, saving it into `.csv` format

Here, the modified '`pivoted_df`' DataFrame is saved to a **CSV file** named '`sleep_day_data_cleaned.csv`' without including the index column.

In [14]:
pivoted_df.to_csv('sleep_day_data_cleaned.csv', index=False)   # Save the modified 'df' DataFrame to a CSV file named 'sleep_day_data_cleaned.csv'