# scikit-learn

scikit-learn is great for performing all major data analysis operations. It also contains datasets. In this code, we will load a dataset and fit a simple linear regression.

In [1]:
!pip install sklearn

import pandas as pd
import numpy as np
from sklearn import datasets as ds



In [2]:
# Load the Boston Housing dataset
dataset = ds.load_boston()

# It is a dictionary, see the keys for details:
print(dataset.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [3]:
# The 'DESCR' key holds a description text for the whole dataset
print(dataset['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [4]:
# The data (independent variables) are stored under the 'data' key
# The names of the independent variables are stored in the 'feature_names' key
# Let's use them to create a DataFrame object:
df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
print(df.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  


In [5]:
# The dependent variable is stored separately
df_y = pd.DataFrame(data=dataset['target'], columns=['target'])
print(df_y.head())

   target
0    24.0
1    21.6
2    34.7
3    33.4
4    36.2


In [6]:
# Now, let's build a linear regression model
from sklearn.linear_model import LinearRegression as LR

# First we create a linear regression object
regression = LR()

# Then, we fit the independent and dependent data
regression.fit(df, df_y)

# We can obtain the R^2 score (more on this later)
print(regression.score(df, df_y))

0.7406426641094094


Very often, we need to perform an operation on a single observation. In that case, we have to reshape the data using numpy:

In [7]:
# Consider a single observation 
so = df.loc[2, :]
print(so)

# Just the values of the observation without meta data
print(so.values)

# Reshaping yields a new matrix with one row with as many columns as the original observation (indicated by the -1)
print(np.reshape(so.values, (1, -1)))

CRIM         0.02729
ZN           0.00000
INDUS        7.07000
CHAS         0.00000
NOX          0.46900
RM           7.18500
AGE         61.10000
DIS          4.96710
RAD          2.00000
TAX        242.00000
PTRATIO     17.80000
B          392.83000
LSTAT        4.03000
Name: 2, dtype: float64
[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
 6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
 4.0300e+00]
[[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]]


In [8]:
# For two observations:
so_2 = df.loc[2:3, :]
print(np.reshape(so_2.values, (2, -1)))

[[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  2.9400e+00]]


This concludes our quick run-through of some basic functionality of the modules. Later on, we will use more and more specialized functions and objects, but for now this allows you to play around with data already.