# SEMANTIVE: DATA SCIENTIST RECRUITMENT TASK

##### Please download and load the  [abalone dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/).  

##### You can use information from [this](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names) file to add the proper headers to the columns. 

The whole task will be driven by supervised learning problem.
Let's define target variable as $Rings / 1.5$ (it rougly corresponds to abalone's age).  
We strongly encourage you use scikit-learn for the modeling tasks (but feel free to use different tools if you think they are appropriate).  

##### First 2 tasks are obligatory. From the tasks 3-5 you can pick and complete 2.

In [2]:
#A place for the imports
import scipy
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
from sklearn.cluster import KMeans,MeanShift
from sklearn.model_selection import cross_val_score


***
# 1. Data exploration

##### Explore the data  and provide a short summary of your findings. Pay attention to the target variable.

In [3]:
#Looking at the data I can see 8 features and 1 target value which is the number of rings.
#Data is saved in a CSV formatted file.
#Sex values are a character, target values are an integer and the rest are floats/doubles.
#There are 4177 entries in the dataset.

***
# 2. Supervised learning

##### Prepare the data for the modeling.  

###### Choose 2 supervised learning algorithms, that you think are suitable for this problem. Describe shortly how these algorithms work. Use them on the data and describe the results.  
  
Note. that we're more interested in comprehensive description and explanation of your choice than in model scores, so we don't expect you to tune your model yet.

In [4]:
#Preparing the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

#getting the data

data = pd.read_csv(url)
data = np.split(data,[8],1)

features = data[0]
labels = data[1]

#Adding names to columns
features.columns = ['Sex','Lenght','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
labels.columns = ['Rings']

#using 0 for male, 1 for female and 2 for infant
features['Sex'] = features['Sex'].map({'M' : 0 , 'F' : 1 , 'I' : 2})

#converting to proper types for use in cross validation
features = features.astype('float')
labels = labels.astype('int')

#Dividing data int training and test samples
#samples from 0-3499 are for training and samples from 3500 to 4177 are for testing
training_features = np.split(features,[3500])
test_features = training_features[1]
training_features = training_features[0]

training_labels = np.split(labels,[3500])
test_lables = training_labels[1]
training_labels = training_labels[0]

#converting data into vectors
training_features_matrix = np.matrix(training_features).astype(np.float)
training_labels_matrix = np.matrix(training_labels).astype(np.int)
test_features_matrix = np.matrix(test_features).astype(np.float)
test_lables_matrix = np.matrix(test_lables).astype(np.int)

#As a first supervised algorythm I have chosen Ridge regression. It's a linear regression algorithm that tries to make a line
#which tries to predict the target value given some features.
#Thanks to the delta value it's less prone to overfitting training data than for example least squares regression
#It also works well with a large number of features

ridge_model = Ridge()
ridge_model.fit(training_features_matrix,training_labels_matrix)
score = ridge_model.score(test_features,test_lables)
print('Ridge model predictions are: ' + repr(score*100) + '% correct')


#Second supervised algorythm I used is K nearest neighbours
#This time it's a classification algorithm, not a regression one. That means that now instead trying to predict a value
#we try to predict a category.
#That means that insted getting float point values like in Ridge, predictions will be integer values.

neighbours = KNeighborsClassifier()
neighbours.fit(training_features_matrix,training_labels_matrix)
score_neighbours = neighbours.score(test_features_matrix,test_lables_matrix)
print('K neares neighbours model predictions are: ' + repr(score_neighbours*100) + '% correct')

Ridge model predictions are: 42.9845143155875% correct
K neares neighbours model predictions are: 26.035502958579883% correct




***
# Other tasks - Pick 2 of them



# 3. Dimensionality reduction


##### What are the applications of the dimensionality reduction?   

###### Pick two algorithms that could be useful for exploration or supervised learning problem and apply them on the data. Describe what you've found. Provide a short description of the algorithms you've chosen.

Note: feature selection is also one type of dimensionality reduction.


# 4. Clustering

##### What is clustering used for? Name the popular types of clustering. Pick two clustering algorithms and run them on the dataset. Describe what you've found. Does it help with the supervised learning task?

In [11]:
#Clustering is an operation that tries to group individual pieces od data based on some characteristics.
#It's usually a type of unsupervised learning so it doesn't require labels.

#One of popular clustering algorithms is K-means clustering.
#It works by choosing K (which is the number of groups) and then placing group center points randomly and assigning data to the
#group which's center is the closest.
#Then we recalculate the position of the centers so that it's in the mean of all datapoint that were assigned to this group

means = KMeans()
means.fit(training_features_matrix,training_labels_matrix)
score_means = means.score(test_features_matrix,test_lables_matrix)

#Another popular clustering algorithm is Mean-Shift.
#It works by choosing some number of circles with radius R and then shifting them in the direction of most dense areas of data.
#Thanks to this it will automaticly discover the number of groups.

mean_shift = MeanShift()
mean_shift.fit(training_features_matrix)
predict_mean_shift = mean_shift.predict(test_features_matrix)

#It helps in categorizing the data


# 5. Hyperparameter selection and crossvalidation

##### We imagine that you did some modeling in 2nd task with the methods that have some tunable hyperparameters. If they don't, either find a versions of them that have that are tunable, or pick the different ones.

##### Tune the hyperparameters of your model using cross-validation. Does it make it better? Does it solve overfitting problems? Is cross-validation score worse than score that your model achieves on test set?

In [22]:
#Score of my Ridge model without changing the deafult hyperparameters value is 42.9845143155875%
#And score of K neares neighbours model is 26.035502958579883 %

score_cross_ridge = cross_val_score(ridge_model,features,labels,cv=5)
print('Ridge regression using cross validation results: {} Mean of the scores is: {}%'.format(score_cross_ridge,score_cross_ridge.mean()*100))
#The mean of the scores is slightly worse than the result we got before, but some individual scores are much better for example
#52.75%. The mean score of crossvalidation shows more accurately how the model would behave with different data samples.
                                    
score_cross_neighbours = cross_val_score(neighbours,features,labels,cv=5)
print('K neares neighbours using cross validation results: {} Mean of the scores is: {}%'.format(score_cross_neighbours,score_cross_neighbours.mean()*100))
#With K nearest neighbours algorythm the situation is simmilar. Some scores are higher than the previous score but the mean is lower.
#Crossvalidation score is indeed lower than what model achieves on the test set

Ridge regression using cross validation results: [0.4031994  0.21838992 0.5156747  0.52751759 0.45207689] Mean of the scores is: 42.33717008551297%
K neares neighbours using cross validation results: [0.19294118 0.2627824  0.22836538 0.24607961 0.25970874] Mean of the scores is: 23.797546296906198%


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
