Abalone (from Spanish Abulón) are shellfish, a genus of gastropods. Abalone are known by their colorful "pearlescent" inside shell. This is also called ear-shell, ormer in Guernsey, abalone in South Africa, and pāua in New Zealand.
Abalone Data Set :
Abstract: Predict the age of abalone from physical measurements.

Data Set Information:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. 

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).


Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem. 

Name / Data Type / Measurement Unit / Description 
----------------------------- 
Sex / nominal / -- / M, F, and I (infant) 
Length / continuous / mm / Longest shell measurement 
Diameter	/ continuous / mm / perpendicular to length 
Height / continuous / mm / with meat in shell 
Whole weight / continuous / grams / whole abalone 
Shucked weight / continuous	/ grams / weight of meat 
Viscera weight / continuous / grams / gut weight (after bleeding) 
Shell weight / continuous / grams / after being dried 
Rings / integer / -- / +1.5 gives the age in years 

In [422]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [423]:
#import data
columns=['Sex','Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight','Rings']
abalone_data=pd.read_csv('C:\\Users\\Lenovo\\ML\\EX\\abalone\\abalone.data',names=columns)
abalone_data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [424]:
abalone_data.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


In [425]:
#checking for null values
print (abalone_data.isnull().sum())

Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64


In [426]:

#checking the correlation
correlationMatrix = abalone_data.corr() #computes correlation between 2 columns 
correlationMatrix

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
Length,1.0,0.986812,0.827554,0.925261,0.897914,0.903018,0.897706,0.55672
Diameter,0.986812,1.0,0.833684,0.925452,0.893162,0.899724,0.90533,0.57466
Height,0.827554,0.833684,1.0,0.819221,0.774972,0.798319,0.817338,0.557467
Whole weight,0.925261,0.925452,0.819221,1.0,0.969405,0.966375,0.955355,0.54039
Shucked weight,0.897914,0.893162,0.774972,0.969405,1.0,0.931961,0.882617,0.420884
Viscera weight,0.903018,0.899724,0.798319,0.966375,0.931961,1.0,0.907656,0.503819
Shell weight,0.897706,0.90533,0.817338,0.955355,0.882617,0.907656,1.0,0.627574
Rings,0.55672,0.57466,0.557467,0.54039,0.420884,0.503819,0.627574,1.0


In [427]:
X=abalone_data.iloc[:,:-1].values
y=abalone_data.iloc[:,8].values

In [428]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder=LabelEncoder()
X[:,0]=labelencoder.fit_transform(X[:,0])
onehotencoder=OneHotEncoder(categorical_features=[3])
X=onehotencoder.fit_transform(X).toarray()
X=X[:,1:]

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [429]:
'''#normalize X
from sklearn import preprocessing
normalized_X = preprocessing.normalize(X)
normalized_X'''

'#normalize X\nfrom sklearn import preprocessing\nnormalized_X = preprocessing.normalize(X)\nnormalized_X'

In [430]:
#splitting the data set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 4)
y_train = y_train.ravel()
y_test = y_test.ravel()

In [431]:
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [432]:
#predicting the model
y_pred=regressor.predict(X_test)
regressor.score(X_train, y_train)

0.3290319906916449

In [421]:
#splitting the data set
for i in range(1,10):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = i)
    y_train = y_train.ravel()
    y_test = y_test.ravel()
    regressor=LinearRegression()
    regressor.fit(X_train, y_train)
    y_pred=regressor.predict(X_test)
    print(regressor.score(X_train, y_train))
    print(regressor.score(X_test, y_test))

0.32489351661172783
0.29993380206628817
0.3148549860633636
0.32313448580929927
0.3328401697882677
0.27691761916790225
0.3290319906916449
0.2882533821242105
0.317297565945562
0.3158062728059582
0.3020867458901476
0.35228297239299866
0.32311427496949796
0.30088293642171515
0.3324808658680033
0.2818701372046377
0.323999389095039
0.29925779116158424


In [433]:
### STATSMODELS ###
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((4177, 1)).astype(int), values = X, axis = 1)

In [434]:
X.shape

(4177, 9)

In [437]:


X_opt=X[:,[0,1,2,3,4,5,6,7,8]]
regressor_stat=sm.OLS(endog =Y, exog = X_opt).fit()

regressor_stat.summary()
# print the R-squared value for the model
#regressor_stat.rsquared

0,1,2,3
Dep. Variable:,newRings,R-squared:,0.318
Model:,OLS,Adj. R-squared:,0.316
Method:,Least Squares,F-statistic:,242.7
Date:,"Sun, 12 May 2019",Prob (F-statistic):,0.0
Time:,10:16:26,Log-Likelihood:,-2026.0
No. Observations:,4177,AIC:,4070.0
Df Residuals:,4168,BIC:,4127.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0134,0.048,-0.279,0.780,-0.108,0.081
x1,-0.0294,0.394,-0.075,0.941,-0.801,0.742
x2,0.0014,0.007,0.187,0.852,-0.013,0.016
x3,-1.3357,0.324,-4.128,0.000,-1.970,-0.701
x4,1.6932,0.394,4.295,0.000,0.920,2.466
x5,1.1675,0.130,8.984,0.000,0.913,1.422
x6,-2.3125,0.146,-15.818,0.000,-2.599,-2.026
x7,-0.3636,0.231,-1.573,0.116,-0.817,0.090
x8,1.2429,0.201,6.197,0.000,0.850,1.636

0,1,2,3
Omnibus:,273.299,Durbin-Watson:,1.585
Prob(Omnibus):,0.0,Jarque-Bera (JB):,219.966
Skew:,0.478,Prob(JB):,1.7199999999999998e-48
Kurtosis:,2.407,Cond. No.,160.0


In [438]:
#logistic

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score

'''Simple Logistic Regression Model
No of Classes : 2
1 - Rings > 10
0 - Rings <= 10'''
'''Creating New Target Variable '''
abalone_data['newRings'] = np.where(abalone_data['Rings'] > 10,1,0)

'''Learning Features and Predicting Features'''
X = abalone_data.drop(['newRings','Rings','Sex'], axis = 1)


Y=abalone_data['newRings']


# instantiate a logistic regression model, and fit with X and y
model = LogisticRegression()
model = model.fit(X, Y)

# check the accuracy on the training set
model.score(X, Y)






0.7711276035432129

In [439]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
model2 = LogisticRegression()
model2.fit(X_train, y_train)

# predict class labels for the test set
predicted = model2.predict(X_test)
predicted

# generate class probabilities
probs = model2.predict_proba(X_test)
probs

# generate evaluation metrics
print(metrics.accuracy_score(y_test, predicted))
#print(metrics.roc_auc_score(y_test, probs[:,7]))


#print(metrics.confusion_matrix(y_test, predicted))
print(metrics.classification_report(y_test, predicted))





0.7563451776649747
              precision    recall  f1-score   support

           0       0.79      0.87      0.83       923
           1       0.67      0.52      0.59       456

   micro avg       0.76      0.76      0.76      1379
   macro avg       0.73      0.70      0.71      1379
weighted avg       0.75      0.76      0.75      1379





In [440]:
# evaluate the model using 10-fold cross-validation
scores = cross_val_score(LogisticRegression(), X, Y, scoring='accuracy', cv=5)
scores, scores.mean()



(array([0.80741627, 0.73803828, 0.76167665, 0.7760479 , 0.74730539]),
 0.7660968971148613)

In [443]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors = 11)  # n_neighbors means k
X=abalone_data.iloc[:,1:-1].values
y=abalone_data.iloc[:,8].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)

print("{} NN Score: {:.2f}%".format(2, knn.score(X_test, y_test)*100))

from sklearn.metrics import accuracy_score
scoreList = []
for i in range(80,120):
    knn2 = KNeighborsRegressor(n_neighbors = i)  # n_neighbors means k
    knn2.fit(X_train, y_train)
    scoreList.append(knn2.score(X_test, y_test))
    print(i,knn2.score(X_test, y_test))
    #print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",i)
    

2 NN Score: 100.00%
80 0.9998868616027304
81 0.9998756553707333
82 0.9998772078627538
83 0.9998715880911324
84 0.9998611613638949
85 0.9998644089389124
86 0.9998644429071281
87 0.9998437333078354
88 0.9998396490312061
89 0.9998395094613038
90 0.9998430561040724
91 0.9998215603019293
92 0.9998242570465561
93 0.9998079387351242
94 0.9998120034087924
95 0.9998094057408027
96 0.9998087341071799
97 0.9998095234219467
98 0.9997994037097077
99 0.9998034356930958
100 0.9998030879949061
101 0.9997899451452894
102 0.9997868005658764
103 0.9997665224816166
104 0.9997709908475843
105 0.9997696857407755
106 0.9997635132318758
107 0.9997567521038029
108 0.9997499998794112
109 0.9997178887791857
110 0.9996905014757466
111 0.9996811613599824
112 0.9996797772761741
113 0.9996636096511841
114 0.9996362069893889
115 0.999630862377965
116 0.9996040846999417
117 0.9995914365425056
118 0.9995910375066103
119 0.9995738196462302


In [444]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeRegressor
X=abalone_data.iloc[:,1:-1].values
y=abalone_data.iloc[:,8].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
classifier = DecisionTreeRegressor( random_state = 20)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

print(cm)
print(classifier.score(X_train, y_train))
print(classifier.score(X_test, y_test))

[[923   0]
 [  0 456]]
1.0
1.0


In [445]:
# Gaussian Naive Bayes
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB

# Fit a Naive Bayes model to the data
model = GaussianNB()
# Fit the training model
model.fit(X_train,y_train)
# Predicted outcomes
predicted = model.predict(X_test)

# Actual Expected Outvomes
expected = y_test
print(metrics.accuracy_score(expected, predicted))

0.7933284989122552
