Abalone Data Set :
Abstract: Predict the age of abalone from physical measurements.

Data Set Information:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. 

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).


Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem. 

Name / Data Type / Measurement Unit / Description 
----------------------------- 
Sex / nominal / -- / M, F, and I (infant) 
Length / continuous / mm / Longest shell measurement 
Diameter	/ continuous / mm / perpendicular to length 
Height / continuous / mm / with meat in shell 
Whole weight / continuous / grams / whole abalone 
Shucked weight / continuous	/ grams / weight of meat 
Viscera weight / continuous / grams / gut weight (after bleeding) 
Shell weight / continuous / grams / after being dried 
Rings / integer / -- / +1.5 gives the age in years 

In [312]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [313]:
columns=['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
abalone_data=pd.read_csv('C:\\Users\\Lenovo\\ML\\EX\\abalone\\abalone.data',names=columns)
abalone_data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [345]:
abalone_data.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


In [314]:
''' #Detecting null numbers 
cnt=0
for names in columns:
    for row in abalone_data[names]:
                            #int(row)
                            try:
                                abalone_data.loc[cnt, 'Sex']=np.nan
                            except ValueError:
                                pass
                                cnt+=1
                                        
                                              
        
print(cnt)*/'''

" #Detecting null numbers \ncnt=0\nfor names in columns:\n    for row in abalone_data[names]:\n                            #int(row)\n                            try:\n                                abalone_data.loc[cnt, 'Sex']=np.nan\n                            except ValueError:\n                                pass\n                                cnt+=1\n                                        \n                                              \n        \nprint(cnt)*/"

In [315]:
print (abalone_data.isnull().sum())

Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64


In [316]:
#abalone_data['Sex']=abalone_data.fillna(abalone_data['Sex'].max)

In [352]:


correlationMatrix = abalone_data.corr() #computes correlation between 2 columns 
correlationMatrix

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
Length,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672
Diameter,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466
Height,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467
Whole weight,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039
Shucked weight,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884
Viscera weight,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819
Shell weight,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574
Rings,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0


In [317]:
X=abalone_data.iloc[:,:-1].values
y=abalone_data.iloc[:,8].values

In [318]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder=LabelEncoder()
X[:,0]=labelencoder.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[3])
X=onehotencoder.fit_transform(X).toarray()
X=X[:,1:]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [320]:
'''#normalize X
from sklearn import preprocessing
normalized_X = preprocessing.normalize(X)
normalized_X'''

array([[0.        , 0.92375577, 0.21015444, ..., 0.10369159, 0.04664967,
        0.06928168],
       [0.        , 0.96892041, 0.16956107, ..., 0.04820379, 0.02349632,
        0.03391221],
       [0.        , 0.        , 0.51832817, ..., 0.25085128, 0.13838384,
        0.20537531],
       ...,
       [0.        , 0.78919743, 0.23675923, ..., 0.20736163, 0.11344713,
        0.1215364 ],
       [0.        , 0.        , 0.41560791, ..., 0.35310048, 0.17355787,
        0.19683191],
       [0.        , 0.63597828, 0.22577229, ..., 0.30065873, 0.11972291,
        0.15740462]])

In [346]:
#splitting the data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.35, random_state = 42)
y_train = y_train.ravel()
y_test = y_test.ravel()

In [347]:
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [348]:
#predicting the model
y_pred=regressor.predict(X_test)
regressor.score(X_train, y_train)

0.5279323258162505

In [302]:
#splitting the data set
for i in range(1,10):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = i)
    y_train = y_train.ravel()
    y_test = y_test.ravel()
    regressor=LinearRegression()
    regressor.fit(X_train, y_train)
    y_pred=regressor.predict(X_test)
    print(regressor.score(X_train, y_train))
    print(regressor.score(X_test, y_test))

0.5291316595221202
0.49951611561610276
0.5121619241889455
0.5374185176345461
0.5324103448485628
0.4869339937368166
0.5315262418728792
0.4911487833213459
0.5396003129782733
0.4748528026403341
0.5056938806457199
0.5544773862354259
0.5173955321487402
0.5233358249394888
0.5344706252646849
0.48717992679379174
0.529067656477977
0.500110014529761


In [324]:
### STATSMODELS ###
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4177, 1)).astype(int), values = X, axis = 1)

In [325]:
X.shape

(4177, 9)

In [332]:


X_opt=X[:,[0,1,4,6,7,8,]]
regressor_stat=sm.OLS(endog =Y, exog = X_opt).fit()

regressor_stat.summary()
# print the R-squared value for the model
#regressor_stat.rsquared

0,1,2,3
Dep. Variable:,y,R-squared:,0.502
Model:,OLS,Adj. R-squared:,0.502
Method:,Least Squares,F-statistic:,842.4
Date:,"Sun, 12 May 2019",Prob (F-statistic):,0.0
Time:,07:37:21,Log-Likelihood:,-9358.4
No. Observations:,4177,AIC:,18730.0
Df Residuals:,4171,BIC:,18770.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.0565,0.250,12.224,0.000,2.566,3.547
x1,0.7917,2.277,0.348,0.728,-3.672,5.255
x2,14.4625,0.965,14.991,0.000,12.571,16.354
x3,-11.7660,0.468,-25.126,0.000,-12.684,-10.848
x4,0.9117,1.042,0.875,0.381,-1.130,2.954
x5,21.1096,0.692,30.519,0.000,19.754,22.466

0,1,2,3
Omnibus:,1065.43,Durbin-Watson:,1.347
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3165.687
Skew:,1.313,Prob(JB):,0.0
Kurtosis:,6.361,Cond. No.,76.7


In [333]:
#logistic

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score




# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, Y)

# check the accuracy on the training set
model.score(X, Y)






0.2573617428776634

In [337]:
abalone_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
Sex               4177 non-null object
Length            4177 non-null float64
Diameter          4177 non-null float64
Height            4177 non-null float64
Whole weight      4177 non-null float64
Shucked weight    4177 non-null float64
Viscera weight    4177 non-null float64
Shell weight      4177 non-null float64
Rings             4177 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [260]:
model2 = LogisticRegression()
model2.fit(X_train, y_train)

# predict class labels for the test set
predicted = model2.predict(X_test)
predicted

# generate class probabilities
probs = model2.predict_proba(X_test)
probs

# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))
#print(metrics.roc_auc_score(y_test, probs[:,7]))


#print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))







0.25598259608411894
              precision    recall  f1-score   support

         3.0       0.00      0.00      0.00         3
         4.0       0.00      0.00      0.00        21
         5.0       0.00      0.00      0.00        50
         6.0       0.20      0.24      0.22        84
         7.0       0.38      0.26      0.31       144
         8.0       0.24      0.44      0.31       179
         9.0       0.27      0.50      0.35       229
        10.0       0.22      0.31      0.26       213
        11.0       0.29      0.25      0.27       155
        12.0       0.00      0.00      0.00        88
        13.0       0.00      0.00      0.00        61
        14.0       0.00      0.00      0.00        37
        15.0       0.00      0.00      0.00        34
        16.0       0.00      0.00      0.00        21
        17.0       0.00      0.00      0.00        17
        18.0       0.00      0.00      0.00        15
        19.0       0.00      0.00      0.00         9
       

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [200]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, Y, scoring='accuracy', cv=5)
scores, scores.mean()



(array([0.19630485, 0.21779859, 0.24822695, 0.29216152, 0.27446301,
        0.22541966, 0.24396135, 0.25365854, 0.27941176, 0.26419753]),
 0.24956037715178753)