#**Week-8 Assignment**
##**Weight Log Info**
### *By Arijit Dhali [Linkedin](https://www.linkedin.com/in/arijit-dhali-b255b0138/)*
---
Respondents generated this dataset to a distributed survey via Amazon Mechanical Turk between 03.12.2016 and 05.12.2016. Thirty eligible Fitbit users consented to submit personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents the use of different Fitbit trackers and individual tracking behaviors/preferences.

---
This Notebook contains:
### `Dataset : Weight Log Info`

# **Importing Libraries**


So, inorder to perform anything on the data we must require to import the librarires first and set the diplay view of the dataset.

This code snippet imports necessary Python libraries, `sets display options for Pandas`, and prepares the environment for data analysis and visualization.

In [1]:
# Importing required libraries for data analysis and visualization
import pandas as pd                       # Pandas for data manipulation and analysis
import numpy as np                        # NumPy for numerical operations
import matplotlib.pyplot as plt           # Matplotlib for basic plotting
import seaborn as sns                     # Seaborn for statistical data visualization
import plotly.express as px               # Plotly Express for interactive visualizations

# Setting display options for Pandas to show three decimal places for floating-point numbers
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# **Loading Dataset**

After importing librarires, we will import the data using `GitHub` link of raw file

Continuing the setup for data analysis by adjusting `Pandas display options` and then loads a dataset from a `URL` into a `Pandas` DataFrame.

In [2]:
# Display all columns without truncation
pd.set_option('display.max_columns', None)

# Load car-related dataset from URL into 'df' DataFrame
url = 'https://raw.githubusercontent.com/ArijitDhali/PrepInsta-DA-Week-8/main/Raw%20Files/weightLogInfo_merged.csv'
df = pd.read_csv(url, encoding='unicode_escape')

# Display first two rows of the loaded DataFrame
df.head(2)

Unnamed: 0,Id,Date,WeightKg,WeightPounds,Fat,BMI,IsManualReport,LogId
0,1503960366,5/2/2016 11:59:59 PM,52.6,115.96,22.0,22.65,True,1462233599000
1,1503960366,5/3/2016 11:59:59 PM,52.6,115.96,,22.65,True,1462319999000


# **Preliminary Data Inspection**

To perform any operation on the dataset, we need to get familiarized with the dataset.

Showcasing the data types of each column in the '`df`' DataFrame.

In [3]:
df.dtypes        # Display data types of columns in the 'df' DataFrame

Id                  int64
Date               object
WeightKg          float64
WeightPounds      float64
Fat               float64
BMI               float64
IsManualReport       bool
LogId               int64
dtype: object

Detailed information about the '`df`' DataFrame, including *data types* and *memory usage*

In [4]:
df.info(verbose=True)        # Display concise information about 'df' DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Id              67 non-null     int64  
 1   Date            67 non-null     object 
 2   WeightKg        67 non-null     float64
 3   WeightPounds    67 non-null     float64
 4   Fat             2 non-null      float64
 5   BMI             67 non-null     float64
 6   IsManualReport  67 non-null     bool   
 7   LogId           67 non-null     int64  
dtypes: bool(1), float64(4), int64(2), object(1)
memory usage: 3.9+ KB


Showing the number of rows and columns

In [5]:
df.shape         # Display the shape (rows, columns)

(67, 8)

Using `df.describe()` to generate descriptive statistics for the numerical columns in the '`df`' DataFrame.

In [6]:
df.describe()      # Generate descriptive statistics for numerical columns

Unnamed: 0,Id,WeightKg,WeightPounds,Fat,BMI,LogId
count,67.0,67.0,67.0,2.0,67.0,67.0
mean,7009282134.66,72.04,158.81,23.5,25.19,1461771594283.58
std,1950321943.92,13.92,30.7,2.12,3.07,782994783.61
min,1503960366.0,52.6,115.96,22.0,21.45,1460443631000.0
25%,6962181067.0,61.4,135.36,22.75,23.96,1461079185000.0
50%,6962181067.0,62.5,137.79,23.5,24.39,1461801599000.0
75%,8877689391.0,85.05,187.5,24.25,25.56,1462375450500.0
max,8877689391.0,133.5,294.32,25.0,47.54,1463097599000.0


# **Data Cleaning and Trimming**

Modifying each and every column accordingly to get a smooth analysis in data vizualization.

### 1. Handling the large missing values

Now, we will calculate and display the percentage of missing values for each column in the '`df`' DataFrame. <br>This information helps in understanding the completeness of the dataset and identifies columns with **missing data**.

In [7]:
# Calculate the percentage of missing values for each column in 'df' DataFrame
row_size = df.shape[0]
for i in df.columns:
    if df[i].isnull().sum() > 0:
        print(i, "----------", (df[i].isnull().sum() / row_size) * 100)

Fat ---------- 97.01492537313433


 checking for and printing the number of duplicate rows in the '`df`' DataFrame, helping to identify and address potential data duplication issues.

In [8]:
# Check for duplicate rows in 'car' DataFrame
df.duplicated().sum()

0

Remove the duplicate rows from the '`df`' DataFrame

In [9]:
# Drop duplicate rows from the 'car' DataFrame
df = df.drop_duplicates()
df.duplicated().sum()

0

### 2. Remove Unnecessary Columns

Here we specify the column names to drop. Now we drop these specified columns from the DataFrame.

In [10]:
# Column names to drop
columns_to_drop = ['IsManualReport', 'Fat', 'LogId']

# Drop the specified columns
df = df.drop(columns=columns_to_drop)

In [11]:
df.sample()

Unnamed: 0,Id,Date,WeightKg,WeightPounds,BMI
18,6962181067,4/17/2016 11:59:59 PM,61.4,135.36,23.96


### 3. Standardize Date Column

Here we convert the '`Date`' column to datetime format. Now we extract the date only from '`Date`' and create a new columnrelated to the name of Dataframe. Now we drop the original '`Date`' column and move the new column to its original position. Finally, we print the DataFrame to observe the changes.

In [12]:
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y %I:%M:%S %p')

# Convert the 'Date' column to date only
df['Date Weight'] = df['Date'].dt.strftime('%Y-%m-%d')

# Drop the original 'Date' column
df = df.drop(columns=['Date'])

# Move the 'Date Weight' column to index 2
date_weight_column = df.pop('Date Weight')
df.insert(1, 'Date Weight', date_weight_column)

# Print the DataFrame to see the changes
df.sample(5)

Unnamed: 0,Id,Date Weight,WeightKg,WeightPounds,BMI
40,6962181067,2016-05-10,62.1,136.91,24.24
33,6962181067,2016-05-03,61.0,134.48,23.82
22,6962181067,2016-04-21,61.4,135.36,23.96
0,1503960366,2016-05-02,52.6,115.96,22.65
44,8877689391,2016-04-13,84.9,187.17,25.41


#**Viewing & Saving Clean Data**

Here we melt the original DataFrame to transform it from wide to long format. Now we sort the melted DataFrame. Now we pivot the sorted melted DataFrame to reshape it back to wide format. Finally, we print the pivoted DataFrame.

In [13]:
# Melt the original DataFrame to convert it from wide to long format
melted_df = df.melt(id_vars=['Id', 'Date Weight'], value_vars=['WeightKg', 'WeightPounds', 'BMI'], var_name='Variable', value_name='Value')

# Sort the melted DataFrame by 'Id' and 'Date Weight'
sorted_df = melted_df.sort_values(by=['Id', 'Date Weight'])

# Pivot the sorted melted DataFrame
pivoted_df = sorted_df.pivot_table(index=['Id', 'Date Weight'], columns='Variable', values='Value', aggfunc='first').reset_index()

# Print the pivoted DataFrame
print(pivoted_df)

Variable          Id Date Weight   BMI  WeightKg  WeightPounds
0         1503960366  2016-05-02 22.65     52.60        115.96
1         1503960366  2016-05-03 22.65     52.60        115.96
2         1927972279  2016-04-13 47.54    133.50        294.32
3         2873212765  2016-04-21 21.45     56.70        125.00
4         2873212765  2016-05-12 21.69     57.30        126.32
..               ...         ...   ...       ...           ...
62        8877689391  2016-05-06 25.44     85.00        187.39
63        8877689391  2016-05-08 25.56     85.40        188.27
64        8877689391  2016-05-09 25.61     85.50        188.50
65        8877689391  2016-05-11 25.56     85.40        188.27
66        8877689391  2016-05-12 25.14     84.00        185.19

[67 rows x 5 columns]


Viewing the final and cleaned data, saving it into `.csv` format

Here, the modified '`pivoted_df`' DataFrame is saved to a **CSV file** named '`weight_log_info_data_cleaned.csv`' without including the index column.

In [14]:
pivoted_df.to_csv('weight_log_info_data_cleaned.csv', index=False)   # Save the modified 'df' DataFrame to a CSV file named 'weight_log_info_data_cleaned.csv'