# Data Modelling - Determining AWS S3 Storage Tier

Use the syenthetic data to build model(s) that predicits the storage tier of new files. 
To efficiently and correctly catergorize files into different storage tiers is useful becuase it allows for cost optimization, storage efficiency, and client performance optimization: retrivel times 

## AWS S3 Tiers 

The model(s) built will analyze the metadata of the file to store it in one of three storage classes

Overview of the storage classses in use: 

<table align="left" style="width:50%"> 
    <tr>
        <th>Class</th>
        <th>Use Case</th>
        <th>Tier</th> 
    </tr>
    <tr>
        <td>S3 Standard</td>
        <td>Frequently Accessed</td>
        <td>"Hot"</td>
    </tr>
    <tr>
        <td>S3 One Zone-IA</td>
        <td>Infrequent, low-availability data </td>
        <td>"Warm"</td>
    </tr>
    <tr>
        <td>S3 Glacier (Deep Archieve?)</td>
        <td>Rarely Accessed / Long-term rarely accessed</td>
        <td>"Cold"</td>
    </tr>
</table> 
 

## Step 1) Data Preperation - preprocessing


Possible approaches towards a dataset preprocessing before fitting it a model 

**1)** Handle missing/null values 

**2)** Normalize/scale numerical features 

Standardization: Scale values to have a mean of 0 and standard deviation of 1.

Normalization: Rescale values to fall within a range (e.g., 0–1).

**3)** Encode categorical features : ordinal encoding ( hot > warm > cold)

**4)** Feature engineering + Class imbalance ( create/remove columns ) 

New rows for edge cases to help the model learn critical boundries,

Talk about not dropping access_frequency and frequency of access.  

need to determine if the target variable 'Storage_Tier' in the dataset is balanced, risk the model becoming biased  


#### Handling missing/null values

In [97]:
import numpy as np 
import pandas as pd 

np.random.seed(1)  

df = pd.read_csv("../../data/train-model/train-file-metadata.csv") 

df.head(n=10) 


Unnamed: 0,File_ID,Access_Frequency,Frequnecy_of_Access,File_Size,File_Lifecycle_Stage,Modification_Frequency,File_Age,Storage_Tier
0,File_1,21.2307,24.4205,0.0,0.0,23.3177,2.6015,Hot
1,File_2,24.0589,4.5884,1.8784,8.4383,4.8516,7.7843,Warm
2,File_3,25.1752,10.111,0.0,0.0,5.7295,1.9126,Warm
3,File_4,4.2926,17.4557,0.0,0.0,8.7832,2.3784,Cold
4,File_5,19.9357,23.3908,0.0,0.0,12.2302,3.512,Warm
5,File_6,28.4249,23.1148,7.056,9.1548,9.2075,1.89,Hot
6,File_7,19.6673,8.6535,6.2278,7.922,14.8424,0.0,Warm
7,File_8,16.5561,22.7516,4.0345,3.529,9.6637,5.4443,Warm
8,File_9,9.0017,8.5943,15.8221,15.2384,9.1963,0.0,Warm
9,File_10,17.6529,18.1427,10.5044,4.3349,4.9068,5.525,Warm


In [98]:
print(df.info())
print(df["Storage_Tier"].unique())
print(df.shape) 


df.isna().sum() # shows that there is a complete dataset no missing or null values 
df.describe() #  column values needs to be normalized for better model fit

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   File_ID                 2000 non-null   object 
 1   Access_Frequency        2000 non-null   float64
 2   Frequnecy_of_Access     2000 non-null   float64
 3   File_Size               2000 non-null   float64
 4   File_Lifecycle_Stage    2000 non-null   float64
 5   Modification_Frequency  2000 non-null   float64
 6   File_Age                2000 non-null   float64
 7   Storage_Tier            2000 non-null   object 
dtypes: float64(6), object(2)
memory usage: 125.1+ KB
None
['Hot' 'Warm' 'Cold']
(2000, 8)


Unnamed: 0,Access_Frequency,Frequnecy_of_Access,File_Size,File_Lifecycle_Stage,Modification_Frequency,File_Age
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,15.515501,12.030583,7.976287,5.460082,10.834073,2.669299
std,8.291801,7.569856,8.20544,5.579764,6.968456,2.702448
min,2.0775,1.0072,0.0,0.0,1.0137,0.0
25%,8.7058,5.65815,0.0,0.0,5.4711,0.0
50%,14.5392,9.67665,6.1356,4.30475,8.90235,2.1473
75%,21.5773,18.0578,13.035925,8.843775,15.534425,4.27335
max,34.5958,29.6737,29.6967,19.7989,29.6585,9.8896


the description of the dataset shows that the values need to be standardized for all the columns to have a mean of 0 and a standard deviation of 1. also shows evidence for normalizatoin so will need to resacle values to fall under the range 0 - 1

the dataset is complete, no handling of missing values that have to be imputed by the mean of certain columns 

#### Normalize & Standardize columns

why is it important to standardize and normalize the values? 

standardize helps avoid the features that have larger values from dominating the model and developing a bias and helps bring the values closer to a normal distribution ( i will use visualization tools to check if this is true - seaborn? ). normalizing helps all the features be scaled to the same range most commonly 0 - 1 


In [102]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler 


stand__norm_columns = df.select_dtypes(include=['float64']).columns

scaler_standard = StandardScaler() 
scaler_minmaxScaler = MinMaxScaler() 

df_copy = df.copy() 

#Standardize
standardized_value = scaler_standard.fit_transform(df[stand__norm_columns])
for row, col in enumerate(stand__norm_columns): 
    df_copy[f"{col}"] = standardized_value[:, row ]

 
#Normalzie 
normalized_values = scaler_minmaxScaler.fit_transform(df[stand__norm_columns])
for row, col in enumerate(stand__norm_columns): 
    df_copy[f"{col}"] = normalized_values[:, row]

df_copy = df_copy.round(4) # higher precision for imporved accuracy 

np.random.seed(2) 

output_file = "preprocessed-train-file-metadata.csv" 
df_copy.to_csv(output_file, index=False)



now, use the 'preprocessed-train-file-metadata.csv' for further preprocessing

#### Encode Categorical Features


In [103]:
df = pd.read_csv("preprocessed-train-file-metadata.csv") 

df.head(n=5)

df.columns.unique() 

print(df["Storage_Tier"].dtype)  

tier_mapping = { "Hot" : 2 , "Warm" : 1 , "Cold" : 0 }

df["Storage_Tier_Encoded"] = df["Storage_Tier"].map(tier_mapping) 
output_file = ("preprocessed-train-file-metadata.csv" )
df.to_csv(output_file, index=False) 

print(df.head(n=3))

print(df["Storage_Tier_Encoded"].dtype)


object
  File_ID  Access_Frequency  Frequnecy_of_Access  File_Size  \
0  File_1            0.5890               0.8167     0.0000   
1  File_2            0.6760               0.1249     0.0633   
2  File_3            0.7103               0.3176     0.0000   

   File_Lifecycle_Stage  Modification_Frequency  File_Age Storage_Tier  \
0                0.0000                  0.7786    0.2631          Hot   
1                0.4262                  0.1340    0.7871         Warm   
2                0.0000                  0.1646    0.1934         Warm   

   Storage_Tier_Encoded  
0                     2  
1                     1  
2                     1  
int64


used ordinal coding becuase the value of storage tier does matter to train the model

hot > warm > cold 

2 > 1 > 0 

#### Feature Engineering 

will drop the storage tier  string column , obvious reason 

try to deteremine if each of the columns have a normal distribution, check for tight edge cases that will imporve the model for boundary cases: if this is lacking then will manually add more rows 

check if there is a class imbalance for the storage tier to avoid the model overfitting on a certain tier and having a bias. 

## Step 2) Baseline Model 

### Logisitc Regresion Model 

## Step 3) Model Training and Evaluation

###  K-nearest neighbors (KNN) Model

### Decision Tree Model

### Random Forest Model

### Grading Boosting (XGBoost) Model

## Step 4) Model Deployment 