# Regression with an Abalone Dataset
Input files - [Downloaded here](https://www.kaggle.com/competitions/playground-series-s4e4/data)

Original dataset - [here](https://archive.ics.uci.edu/dataset/1/abalone)
## Development Notes
-  Example of a good notebook from a similar regression competition: https://www.kaggle.com/code/oscarm524/ps-s3-ep16-eda-modeling-submission/notebook
## Libraries

In [2]:
### libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import boxcox
from sklearn.preprocessing import StandardScaler

## Load and Preview Data
### Data Description
****Name / Data Type / Measurement Unit / Description**
- Sex / nominal / -- / M, F, and I (infant)
- Length / continuous / mm / Longest shell measurement
- Diameter	/ continuous / mm / perpendicular to length
- Height / continuous / mm / with meat in shell
- Whole weight / continuous / grams / whole abalone
- Shucked weight / continuous	 / grams / weight of meat
- Viscera weight / continuous / grams / gut weight (after bleeding)
- Shell weight / continuous / grams / after being dried
- Rings / integer / -- / +1.5 gives the age in years

In [3]:
### load data
train_raw = pd.read_csv('Data_Download/train.csv')
test_raw=pd.read_csv('Data_Download/test.csv')

### data info
train_raw.info()
print("\n")
test_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90615 entries, 0 to 90614
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              90615 non-null  int64  
 1   Sex             90615 non-null  object 
 2   Length          90615 non-null  float64
 3   Diameter        90615 non-null  float64
 4   Height          90615 non-null  float64
 5   Whole weight    90615 non-null  float64
 6   Whole weight.1  90615 non-null  float64
 7   Whole weight.2  90615 non-null  float64
 8   Shell weight    90615 non-null  float64
 9   Rings           90615 non-null  int64  
dtypes: float64(7), int64(2), object(1)
memory usage: 6.9+ MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60411 entries, 0 to 60410
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              60411 non-null  int64  
 1   Sex             60411 non-null  object 

In [4]:
### preview data
train_raw.head(5)

Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
0,0,F,0.55,0.43,0.15,0.7715,0.3285,0.1465,0.24,11
1,1,F,0.63,0.49,0.145,1.13,0.458,0.2765,0.32,11
2,2,I,0.16,0.11,0.025,0.021,0.0055,0.003,0.005,6
3,3,M,0.595,0.475,0.15,0.9145,0.3755,0.2055,0.25,10
4,4,I,0.555,0.425,0.13,0.782,0.3695,0.16,0.1975,9


In [7]:
### preview data cont.
train_raw.describe(include='all')

Unnamed: 0,id,Sex,Length,Diameter,Height,Whole weight,Whole weight.1,Whole weight.2,Shell weight,Rings
count,90615.0,90615,90615.0,90615.0,90615.0,90615.0,90615.0,90615.0,90615.0,90615.0
unique,,3,,,,,,,,
top,,I,,,,,,,,
freq,,33093,,,,,,,,
mean,45307.0,,0.517098,0.401679,0.135464,0.789035,0.340778,0.169422,0.225898,9.696794
std,26158.441658,,0.118217,0.098026,0.038008,0.457671,0.204428,0.100909,0.130203,3.176221
min,0.0,,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,22653.5,,0.445,0.345,0.11,0.419,0.1775,0.0865,0.12,8.0
50%,45307.0,,0.545,0.425,0.14,0.7995,0.33,0.166,0.225,9.0
75%,67960.5,,0.6,0.47,0.16,1.0675,0.463,0.2325,0.305,11.0
