# Task
Load the datasets, clean them, merge them, and save the result as "cleaned_merged.csv".

## Load data

### Subtask:
Load the `azure_usage.csv` and `external_factors.csv` datasets into pandas DataFrames.


**Reasoning**:
Load the datasets into pandas DataFrames and display the first few rows.



In [None]:
azure_usage_path = '/content/drive/MyDrive/Infosys Springboard Virtual Internship 6.0/azure_usage (1).csv'
external_factors_path = '/content/drive/MyDrive/Infosys Springboard Virtual Internship 6.0/external_factors.csv'

df = pd.read_csv(azure_usage_path)
df1 = pd.read_csv(external_factors_path)

print("azure_usage (1).csv head:")
display(df.head())

print("\nexternal_factors.csv head:")
display(df1.head())

azure_usage (1).csv head:


Unnamed: 0,date,region,resource_type,usage_cpu,usage_storage,users_active
0,2023-01-01,East US,VM,88,1959,470
1,2023-01-01,East US,Storage,92,1595,388
2,2023-01-01,East US,Container,70,621,414
3,2023-01-01,West US,VM,60,1982,287
4,2023-01-01,West US,Storage,85,1371,351



external_factors.csv head:


Unnamed: 0,date,economic_index,cloud_market_demand,holiday
0,2023-01-01,104.97,0.99,1
1,2023-01-02,106.48,1.15,0
2,2023-01-03,97.66,0.98,0
3,2023-01-04,115.79,1.08,0
4,2023-01-05,95.31,1.05,0


## Data cleaning

### Subtask:
Clean the `azure_usage` and `external_factors` dataframes. This may involve handling missing values, correcting data types, or removing duplicates.


**Reasoning**:
Check for missing values and display the data types in both dataframes as the first step of cleaning.



In [None]:
print("Missing values in df:")
print(df.isnull().sum())

print("\nMissing values in df1:")
print(df1.isnull().sum())

print("\nInfo for df:")
df.info()

print("\nInfo for df1:")
df1.info()

Missing values in df:
date             0
region           0
resource_type    0
usage_cpu        0
usage_storage    0
users_active     0
dtype: int64

Missing values in df1:
date                   0
economic_index         0
cloud_market_demand    0
holiday                0
dtype: int64

Info for df:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   date           1080 non-null   object
 1   region         1080 non-null   object
 2   resource_type  1080 non-null   object
 3   usage_cpu      1080 non-null   int64 
 4   usage_storage  1080 non-null   int64 
 5   users_active   1080 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 50.8+ KB

Info for df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               ------

**Reasoning**:
There are no missing values in either dataframe. The 'date' columns in both dataframes are of type 'object' and should be converted to datetime objects for proper merging and potential time-series analysis. I will convert the 'date' columns to datetime and then check for and remove duplicate rows in both dataframes. Finally, I will display the head and info of the cleaned dataframes.



In [None]:
df['date'] = pd.to_datetime(df['date'])
df1['date'] = pd.to_datetime(df1['date'])

print("\nChecking for duplicates in df:")
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Duplicates removed from df.")

print("\nChecking for duplicates in df1:")
print(df1.duplicated().sum())
df1.drop_duplicates(inplace=True)
print("Duplicates removed from df1.")

print("\nCleaned df head:")
display(df.head())
print("\nCleaned df info:")
df.info()

print("\nCleaned df1 head:")
display(df1.head())
print("\nCleaned df1 info:")
df1.info()


Checking for duplicates in df:
0
Duplicates removed from df.

Checking for duplicates in df1:
0
Duplicates removed from df1.

Cleaned df head:


Unnamed: 0,date,region,resource_type,usage_cpu,usage_storage,users_active
0,2023-01-01,East US,VM,88,1959,470
1,2023-01-01,East US,Storage,92,1595,388
2,2023-01-01,East US,Container,70,621,414
3,2023-01-01,West US,VM,60,1982,287
4,2023-01-01,West US,Storage,85,1371,351



Cleaned df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           1080 non-null   datetime64[ns]
 1   region         1080 non-null   object        
 2   resource_type  1080 non-null   object        
 3   usage_cpu      1080 non-null   int64         
 4   usage_storage  1080 non-null   int64         
 5   users_active   1080 non-null   int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 50.8+ KB

Cleaned df1 head:


Unnamed: 0,date,economic_index,cloud_market_demand,holiday
0,2023-01-01,104.97,0.99,1
1,2023-01-02,106.48,1.15,0
2,2023-01-03,97.66,0.98,0
3,2023-01-04,115.79,1.08,0
4,2023-01-05,95.31,1.05,0



Cleaned df1 info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 90 non-null     datetime64[ns]
 1   economic_index       90 non-null     float64       
 2   cloud_market_demand  90 non-null     float64       
 3   holiday              90 non-null     int64         
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 2.9 KB


## Data merging

### Subtask:
Merge the cleaned dataframes based on a common column (e.g., date).


**Reasoning**:
Merge the two dataframes on the 'date' column using an inner join and display the head and info of the resulting merged dataframe.



In [None]:
merged_df = pd.merge(df, df1, on='date', how='inner')

print("Merged DataFrame head:")
display(merged_df.head())

print("\nMerged DataFrame info:")
merged_df.info()

Merged DataFrame head:


Unnamed: 0,date,region,resource_type,usage_cpu,usage_storage,users_active,economic_index,cloud_market_demand,holiday
0,2023-01-01,East US,VM,88,1959,470,104.97,0.99,1
1,2023-01-01,East US,Storage,92,1595,388,104.97,0.99,1
2,2023-01-01,East US,Container,70,621,414,104.97,0.99,1
3,2023-01-01,West US,VM,60,1982,287,104.97,0.99,1
4,2023-01-01,West US,Storage,85,1371,351,104.97,0.99,1



Merged DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   date                 1080 non-null   datetime64[ns]
 1   region               1080 non-null   object        
 2   resource_type        1080 non-null   object        
 3   usage_cpu            1080 non-null   int64         
 4   usage_storage        1080 non-null   int64         
 5   users_active         1080 non-null   int64         
 6   economic_index       1080 non-null   float64       
 7   cloud_market_demand  1080 non-null   float64       
 8   holiday              1080 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 76.1+ KB


## Save merged data

### Subtask:
Save the merged dataframe to a new CSV file named "cleaned_merged.csv".


In [None]:
merged_df.to_csv("cleaned_merged.csv", index=False)

Converting the date in clean_merged.csv to datetime64[ns] and checking for duplicate and null values.

In [None]:
import pandas as pd

# Load your cleaned merged dataset
new_df = pd.read_csv('/content/drive/MyDrive/Infosys Springboard Virtual Internship 6.0/cleaned_merged.csv')

# Convert 'date' to datetime type
new_df['date'] = pd.to_datetime(new_df['date'], errors='coerce')

# Check the datatype
print(new_df.dtypes['date'])

# Check for missing values
print("Missing values per column:")
print(new_df.isnull().sum())

# Check if there are hidden blanks or spaces
print("\nChecking for blank strings:")
print((new_df == ' ').sum())

duplicates = new_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")



print(new_df.dtypes)


for col in new_df.columns:
    print(col, new_df[col].isin(['NA', 'NaN', 'None', '']).sum())

datetime64[ns]
Missing values per column:
date                   0
region                 0
resource_type          0
usage_cpu              0
usage_storage          0
users_active           0
economic_index         0
cloud_market_demand    0
holiday                0
dtype: int64

Checking for blank strings:
date                   0
region                 0
resource_type          0
usage_cpu              0
usage_storage          0
users_active           0
economic_index         0
cloud_market_demand    0
holiday                0
dtype: int64
Number of duplicate rows: 0
date                   datetime64[ns]
region                         object
resource_type                  object
usage_cpu                       int64
usage_storage                   int64
users_active                    int64
economic_index                float64
cloud_market_demand           float64
holiday                         int64
dtype: object
date 0
region 0
resource_type 0
usage_cpu 0
usage_storage 0
users_act

## Summary:

### Data Analysis Key Findings

*   No missing values were found in either the `azure_usage` or `external_factors` dataframes during the initial check.
*   The 'date' columns in both dataframes were successfully converted from 'object' to 'datetime64[ns]' format.
*   No duplicate rows were identified or removed from either the `azure_usage` or `external_factors` dataframes.
*   The two dataframes were successfully merged using an inner join on the 'date' column, resulting in a `merged_df`.



~Yash
