<img src="./images/banner.png" width="800">

# Data Representation in Scikit-learn

Understanding how data is represented in Scikit-learn is crucial for effectively using the library and building successful machine learning models. This section will introduce you to the fundamental concepts of data representation in Scikit-learn. Proper data representation is the foundation of any machine learning task. It directly impacts the performance of your models and the efficiency of your workflows.


In Scikit-learn, data representation follows specific conventions that allow for:
1. Efficient computation
2. Consistency across different algorithms
3. Easy integration with other scientific Python libraries


Scikit-learn primarily uses two main types of data structures:

1. **Feature Matrix (X)**: Represents the input features
2. **Target Vector (y)**: Represents the output or target variable (for supervised learning)


In [1]:
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Feature matrix
y = iris.target  # Target vector

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

Feature matrix shape: (150, 4)
Target vector shape: (150,)


💡 **Pro Tip:** Always check the shape of your data to ensure it matches Scikit-learn's expectations.


Scikit-learn is built on top of NumPy, and thus uses NumPy arrays as its primary data structure.


In [2]:
# Creating a feature matrix and target vector manually
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 1, 2])

print("Feature matrix:")
print(X)

Feature matrix:
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [3]:
print("Target vector:")
print(y)

Target vector:
[0 1 2]


In Scikit-learn, data is typically organized following the sample-feature convention:

- Each row represents a sample (an instance or observation)
- Each column represents a feature (an attribute or variable)


🤔 **Why This Matters:** This convention allows for intuitive data manipulation and aligns with how most real-world datasets are structured.


```python
# Example of sample-feature convention
X = np.array([
    [height_1, weight_1, age_1],
    [height_2, weight_2, age_2],
    [height_3, weight_3, age_3]
])
```


Understanding the dimensionality of your data is crucial:

- **2D arrays** for feature matrices (n_samples, n_features)
- **1D arrays** for target vectors (n_samples,)


In [4]:
from sklearn.datasets import make_classification

# Generate a random classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2)

print(f"Feature matrix dimensionality: {X.ndim}D")
print(f"Target vector dimensionality: {y.ndim}D")

Feature matrix dimensionality: 2D
Target vector dimensionality: 1D


❗️ **Important Note:** Scikit-learn expects 2D arrays for X, even when dealing with a single feature. You may need to reshape 1D arrays to 2D.


In [5]:
# Reshaping a 1D array to 2D
X_1d = np.array([1, 2, 3, 4, 5])
X_2d = X_1d.reshape(-1, 1)
print(f"Original shape: {X_1d.shape}, Reshaped: {X_2d.shape}")

Original shape: (5,), Reshaped: (5, 1)


Scikit-learn can handle various data types, but it's important to use the appropriate type for each situation:

- **Numerical data**: Typically float64 for continuous variables, int64 for discrete
- **Categorical data**: Encoded as integers or one-hot encoded
- **Text data**: Typically transformed into numerical features


We'll explore these data types in more detail in the following sections.


While NumPy arrays are the foundation, Scikit-learn also supports:

- **Sparse matrices**: For efficiently handling data with many zero values
- **Pandas DataFrames**: For labeled data with mixed types


These advanced data structures will be covered in later sections of this lecture.


By understanding these fundamental concepts of data representation in Scikit-learn, you'll be well-prepared to work with various datasets and machine learning algorithms effectively. In the next sections, we'll dive deeper into specific aspects of data representation and handling in Scikit-learn.

**Table of contents**<a id='toc0_'></a>    
- [Data Representation in Scikit-learn](#toc1_)    
  - [Feature Matrix (X)](#toc1_1_)    
  - [Target Vector (y)](#toc1_2_)    
  - [Relationship Between X and y](#toc1_3_)    
  - [Handling Mismatch Between X and y](#toc1_4_)    
  - [Best Practices](#toc1_5_)    
- [Handling Different Data Types](#toc2_)    
  - [Numerical Data](#toc2_1_)    
  - [Categorical Data](#toc2_2_)    
  - [Text Data](#toc2_3_)    
  - [Handling Mixed Data Types](#toc2_4_)    
- [Sparse Matrices and Their Use Cases](#toc3_)    
  - [What are Sparse Matrices?](#toc3_1_)    
  - [Types of Sparse Matrices in Scikit-learn](#toc3_2_)    
  - [Use Cases for Sparse Matrices](#toc3_3_)    
    - [Text Data Representation](#toc3_3_1_)    
    - [Large-Scale Machine Learning](#toc3_3_2_)    
  - [Working with Sparse Matrices in Scikit-learn](#toc3_4_)    
  - [Best Practices and Considerations](#toc3_5_)    
- [Working with Pandas DataFrames in Scikit-learn](#toc4_)    
  - [Converting DataFrames to Scikit-learn Compatible Format](#toc4_1_)    
  - [Using DataFrames Directly with Scikit-learn](#toc4_2_)    
  - [Feature Selection with DataFrames](#toc4_3_)    
  - [Pipelines with DataFrames](#toc4_4_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Data Representation in Scikit-learn](#toc0_)

Feature matrices and target vectors are the core data structures used in Scikit-learn for machine learning tasks. Understanding these structures is crucial for effectively preparing and manipulating data for your models.


<img src="./images/xy.webp" width="800">

### <a id='toc1_1_'></a>[Feature Matrix (X)](#toc0_)


The feature matrix, typically denoted as X, contains the input variables (features) for your machine learning model. Here are the key characteristics:

- 2D array (or matrix)
- Shape: (n_samples, n_features)
- Each row represents a single sample
- Each column represents a specific feature


In [7]:
import numpy as np

# Example feature matrix
X = np.array(
    [
        [1.0, 2.0, 3.0],  # Sample 1
        [4.0, 5.0, 6.0],  # Sample 2
        [7.0, 8.0, 9.0],  # Sample 3
    ]
)

print("Feature Matrix:")
print(X)

Feature Matrix:
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]


In [8]:
print(f"Shape: {X.shape}")

Shape: (3, 3)


💡 **Pro Tip:** Always check the shape of your feature matrix to ensure it matches the expected input format of Scikit-learn estimators.


Feature matrices can be created from various data sources:

1. **NumPy arrays**:

In [14]:
X = np.array([[1, 2], [3, 4], [5, 6]])
X

array([[1, 2],
       [3, 4],
       [5, 6]])

2. **Lists of lists**:

In [13]:
X = [[1, 2], [3, 4], [5, 6]]
X = np.array(X)  # Convert to NumPy array
X

array([[1, 2],
       [3, 4],
       [5, 6]])

3. **Pandas DataFrames**:

In [12]:
import pandas as pd

df = pd.DataFrame({"A": [1, 3, 5], "B": [2, 4, 6]})
X = df.values  # Convert to NumPy array
X

array([[1, 2],
       [3, 4],
       [5, 6]])

### <a id='toc1_2_'></a>[Target Vector (y)](#toc0_)


The target vector, usually denoted as y, contains the output or dependent variable that your model aims to predict. Here are the key characteristics:

- 1D array
- Shape: (n_samples,)
- Each element corresponds to a sample in the feature matrix


In [16]:
# Example target vector
y = np.array([0, 1, 2])

print("Target Vector:")
print(y)

Target Vector:
[0 1 2]


In [17]:
print(f"Shape: {y.shape}")

Shape: (3,)


Target vectors can be of different types depending on the nature of the problem:

1. **Binary Classification**:
   - Two classes, typically represented as 0 and 1
   ```python
   y_binary = np.array([0, 1, 1, 0, 1])
   ```

2. **Multiclass Classification**:
   - More than two classes, represented as integers
   ```python
   y_multiclass = np.array([0, 1, 2, 1, 0, 2])
   ```

3. **Regression**:
   - Continuous values
   ```python
   y_regression = np.array([0.5, 1.2, 2.3, 1.8, 0.9])
   ```


### <a id='toc1_3_'></a>[Relationship Between X and y](#toc0_)


In supervised learning, there's a direct correspondence between samples in X and y:


In [18]:
X = np.array(
    [
        [1.0, 2.0],  # Features for sample 1
        [3.0, 4.0],  # Features for sample 2
        [5.0, 6.0],
    ]
)  # Features for sample 3

y = np.array(
    [
        0,  # Target for sample 1
        1,  # Target for sample 2
        1,
    ]
)  # Target for sample 3

print(f"Number of samples in X: {X.shape[0]}")
print(f"Number of samples in y: {y.shape[0]}")

Number of samples in X: 3
Number of samples in y: 3


🤔 **Why This Matters:** Ensuring that X and y have the same number of samples is crucial for proper model training and evaluation.


### <a id='toc1_4_'></a>[Handling Mismatch Between X and y](#toc0_)


A common error in machine learning is having mismatched shapes between X and y.


In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Intentional mismatch
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 90)  # Only 90 targets

try:
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    model = LogisticRegression().fit(X_train, y_train)
except ValueError as e:
    print(f"Error: {e}")

Error: Found input variables with inconsistent numbers of samples: [100, 90]


### <a id='toc1_5_'></a>[Best Practices](#toc0_)


1. **Consistency Check**: Always verify that the number of samples in X matches the length of y.
2. **Reshape if Necessary**: Use `reshape(-1, 1)` for single-feature matrices or single-sample inputs.
3. **Use Appropriate Data Types**: Typically float64 for X and int64 for y (in classification tasks).


In [20]:
# Ensuring correct shapes
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, 100)

print(f"X shape: {X.shape}, y shape: {y.shape}")
print(f"X dtype: {X.dtype}, y dtype: {y.dtype}")

X shape: (100, 5), y shape: (100,)
X dtype: float64, y dtype: int64


By mastering the concepts of feature matrices and target vectors, you'll be well-equipped to prepare and manipulate data for various machine learning tasks in Scikit-learn. Remember, proper data representation is the foundation of successful model training and evaluation.

## <a id='toc2_'></a>[Handling Different Data Types](#toc0_)

Scikit-learn can work with various types of data, but it's crucial to understand how to properly represent and handle different data types for optimal model performance. In this section, we'll explore how to work with numerical, categorical, and text data in Scikit-learn.


### <a id='toc2_1_'></a>[Numerical Data](#toc0_)


Numerical data is the most straightforward type to work with in Scikit-learn. Scikit-learn works best with float64 data type for numerical features.


Continuous numerical data can be used directly in most Scikit-learn estimators.


In [58]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# Continuous numerical data
X_continuous = np.array([[1.0, 2.5, 3.2], [4.1, 5.7, 6.3], [7.8, 8.9, 9.0]])

# Standardizing the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_continuous)

print("Original data:")
X_continuous

Original data:


array([[1. , 2.5, 3.2],
       [4.1, 5.7, 6.3],
       [7.8, 8.9, 9. ]])

In [57]:
print("Scaled data:")
X_scaled

Scaled data:


array([[-1.1871835 , -1.22474487, -1.25190792],
       [-0.07195052,  0.        ,  0.05626552],
       [ 1.25913401,  1.22474487,  1.1956424 ]])

Discrete numerical data (integers) can also be used directly, but sometimes scaling or encoding might be beneficial.


In [25]:
from sklearn.preprocessing import OneHotEncoder

# Discrete numerical data
X_discrete = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# One-hot encoding for discrete data
encoder = OneHotEncoder(sparse_output=True)
X_encoded = encoder.fit_transform(X_discrete)

In [56]:
print("Original discrete data:")
X_discrete

Original discrete data:


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [55]:
print("One-hot encoded data:")
X_encoded

One-hot encoded data:


<3x9 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

💡 **Pro Tip:** Consider the nature of your discrete data. If it's ordinal, you might want to use it as is or apply scaling. If it's nominal, one-hot encoding might be more appropriate.


### <a id='toc2_2_'></a>[Categorical Data](#toc0_)


Categorical data requires special handling in Scikit-learn, as most algorithms expect numerical input.


For ordinal data (categories with a meaningful order), you can use OrdinalEncoder.


In [28]:
from sklearn.preprocessing import OrdinalEncoder

# Ordinal categorical data
X_ordinal = np.array(
    [["cold", "warm", "hot"], ["warm", "hot", "cold"], ["hot", "cold", "warm"]]
)

ordinal_encoder = OrdinalEncoder()
X_ordinal_encoded = ordinal_encoder.fit_transform(X_ordinal)

In [54]:
print("Original ordinal data:")
X_ordinal

Original ordinal data:


array([['cold', 'warm', 'hot'],
       ['warm', 'hot', 'cold'],
       ['hot', 'cold', 'warm']], dtype='<U4')

In [53]:
print("Encoded ordinal data:")
X_ordinal_encoded

Encoded ordinal data:


array([[0., 2., 1.],
       [2., 1., 0.],
       [1., 0., 2.]])

For nominal data (categories without inherent order), use OneHotEncoder.


In [34]:
# Nominal categorical data
X_nominal = np.array(
    [["red", "blue", "green"], ["blue", "green", "red"], ["green", "red", "blue"]]
)

onehot_encoder = OneHotEncoder(sparse_output=False)
X_nominal_encoded = onehot_encoder.fit_transform(X_nominal)

In [52]:
print("Original nominal data:")
X_nominal

Original nominal data:


array([['red', 'blue', 'green'],
       ['blue', 'green', 'red'],
       ['green', 'red', 'blue']], dtype='<U5')

In [51]:
print("One-hot encoded nominal data:")
X_nominal_encoded

One-hot encoded nominal data:


array([[0., 0., 1., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1., 1., 0., 0.]])

❗️ **Important Note:** Be cautious with one-hot encoding for high-cardinality categorical features, as it can lead to the curse of dimensionality.


### <a id='toc2_3_'></a>[Text Data](#toc0_)


Text data requires transformation into numerical features before it can be used in Scikit-learn models.


The most common approach is to use CountVectorizer or TfidfVectorizer.


In [37]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
text_data = [
    "The quick brown fox",
    "jumps over the lazy dog",
    "The lazy dog sleeps"
]

# Using CountVectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(text_data)

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(text_data)

In [49]:
print("Count Vectorizer:")
X_count.toarray()


Count Vectorizer:


array([[1, 0, 1, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 1, 1, 0, 0, 1],
       [0, 1, 0, 0, 1, 0, 0, 1, 1]])

In [48]:
print("TF-IDF Vectorizer:")
X_tfidf.toarray()


TF-IDF Vectorizer:


array([[0.54645401, 0.        , 0.54645401, 0.        , 0.        ,
        0.        , 0.54645401, 0.        , 0.32274454],
       [0.        , 0.40619178, 0.        , 0.53409337, 0.40619178,
        0.53409337, 0.        , 0.        , 0.31544415],
       [0.        , 0.4804584 , 0.        , 0.        , 0.4804584 ,
        0.        , 0.        , 0.63174505, 0.37311881]])

🤔 **Why This Matters:** Choosing the right vectorization method can significantly impact your model's performance on text data.


### <a id='toc2_4_'></a>[Handling Mixed Data Types](#toc0_)


In real-world scenarios, you often encounter datasets with mixed data types. Here's an approach to handle such cases:


In [40]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [41]:
# Sample mixed-type data
data = pd.DataFrame({
    'age': [30, 40, 50, 60],
    'income': [50000, 60000, np.nan, 80000],
    'gender': ['M', 'F', 'M', 'F'],
    'category': ['A', 'B', 'A', 'C']
})

In [42]:
# Define preprocessing steps for different types of features
numeric_features = ["age", "income"]
categorical_features = ["gender", "category"]

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [44]:
# Fit and transform the data
X_processed = preprocessor.fit_transform(data)

In [45]:
print("Processed data shape:", X_processed.shape)

Processed data shape: (4, 7)


In [47]:
print("Processed data:")
X_processed

Processed data:


array([[-1.34164079, -1.14707867,  0.        ,  1.        ,  1.        ,
         0.        ,  0.        ],
       [-0.4472136 , -0.22941573,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ],
       [ 0.4472136 , -0.22941573,  0.        ,  1.        ,  1.        ,
         0.        ,  0.        ],
       [ 1.34164079,  1.60591014,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ]])

This approach uses ColumnTransformer to apply different preprocessing steps to different columns based on their data type.


To summarize, here are some best practices for handling different data types:
1. **Understand Your Data**: Always explore and understand the nature of your features before preprocessing.
2. **Scale Numerical Data**: Most models perform better with scaled numerical features.
3. **Encode Categorical Data**: Choose between ordinal encoding and one-hot encoding based on the nature of the categories.
4. **Handle Text Carefully**: Consider the specifics of your text data when choosing a vectorization method.
5. **Use Pipelines**: Combine preprocessing steps with your model in a pipeline to ensure consistent application of transformations.


By mastering these techniques for handling different data types, you'll be well-equipped to prepare diverse datasets for machine learning tasks in Scikit-learn, ensuring that your models receive appropriately formatted input for optimal performance.

## <a id='toc3_'></a>[Sparse Matrices and Their Use Cases](#toc0_)

Sparse matrices are a crucial data structure in Scikit-learn, especially when dealing with high-dimensional data where most of the elements are zero. Understanding sparse matrices can significantly improve the efficiency of your machine learning workflows.


<img src="./images/sparse_dense.gif" width="800">

### <a id='toc3_1_'></a>[What are Sparse Matrices?](#toc0_)


🔑 **Key Concept:** A sparse matrix is a matrix in which most of the elements are zero. The central concept is to store only the non-zero elements, saving memory and computational resources.


In [60]:
import numpy as np
from scipy.sparse import csr_matrix

# Dense matrix with many zeros
dense_matrix = np.array([
    [1, 0, 0, 2],
    [0, 0, 3, 0],
    [4, 0, 0, 0]
])

# Convert to sparse matrix
sparse_matrix = csr_matrix(dense_matrix)
dense_matrix

array([[1, 0, 0, 2],
       [0, 0, 3, 0],
       [4, 0, 0, 0]])

In [61]:
print("Sparse matrix:")
sparse_matrix


Sparse matrix:


<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

### <a id='toc3_2_'></a>[Types of Sparse Matrices in Scikit-learn](#toc0_)


Scikit-learn primarily uses two types of sparse matrix formats from SciPy:

1. **CSR (Compressed Sparse Row)**: Efficient for row slicing and matrix-vector products.
2. **CSC (Compressed Sparse Column)**: Efficient for column slicing and matrix-vector products.


<img src="./images/sparse-matrix.png" width="800">

<img src="./images/csr-coo.png" width="800">

In [62]:
from scipy.sparse import csr_matrix, csc_matrix

# CSR Matrix
csr_mat = csr_matrix(dense_matrix)

# CSC Matrix
csc_mat = csc_matrix(dense_matrix)

In [63]:
print("CSR Matrix:")
csr_mat

CSR Matrix:


<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [64]:
print("CSC Matrix:")
csc_mat

CSC Matrix:


<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Column format>

💡 **Pro Tip:** CSR is generally preferred in Scikit-learn due to its efficiency in most operations.


### <a id='toc3_3_'></a>[Use Cases for Sparse Matrices](#toc0_)


Sparse matrices are particularly useful in several scenarios:


#### <a id='toc3_3_1_'></a>[Text Data Representation](#toc0_)


When working with text data, the bag-of-words or TF-IDF representations often result in sparse matrices.


In [65]:
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print("Shape of sparse matrix:", X.shape)
print("Number of stored elements:", X.nnz)
print("Sparsity: {:.2f}%".format(100 * (1 - X.nnz / (X.shape[0] * X.shape[1]))))

Shape of sparse matrix: (4, 9)
Number of stored elements: 21
Sparsity: 41.67%


#### <a id='toc3_3_2_'></a>[Large-Scale Machine Learning](#toc0_)


For large datasets, sparse matrices can make certain algorithms feasible that would be impractical with dense representations.


In [97]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Generate a large, sparse dataset
X, y = make_classification(
    n_samples=10000,
    n_features=10000,
    n_informative=100,
    random_state=42,
    n_classes=2,
    weights=[0.9, 0.1],
)
X[X < 2.5] = 0  # Introduce sparsity
X_sparse = csr_matrix(X)

In [98]:

# Train logistic regression on sparse data
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_sparse, y)

print("Model trained on sparse matrix")
print("Number of non-zero coefficients:", np.count_nonzero(clf.coef_))

Model trained on sparse matrix
Number of non-zero coefficients: 10000


In [72]:
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X, y)


### <a id='toc3_4_'></a>[Working with Sparse Matrices in Scikit-learn](#toc0_)


Most Scikit-learn estimators automatically handle sparse input when appropriate. However, it's important to know how to create and manipulate sparse matrices.


To create a sparse matrix, you can use the `csr_matrix` or `dok_matrix` functions from SciPy.

In [73]:
from scipy.sparse import csr_matrix, dok_matrix

# From lists
row = [0, 0, 1, 2, 2, 2]
col = [0, 2, 2, 0, 1, 2]
data = [1, 2, 3, 4, 5, 6]
sparse_matrix = csr_matrix((data, (row, col)), shape=(3, 3))

# Using DOK (Dictionary of Keys) format
dok_sparse = dok_matrix((3, 3))
dok_sparse[0, 0] = 1
dok_sparse[0, 2] = 2
dok_sparse[1, 2] = 3
dok_sparse[2, 0] = 4
dok_sparse[2, 1] = 5
dok_sparse[2, 2] = 6

csr_from_dok = dok_sparse.tocsr()

In [74]:
print("CSR Matrix:")
sparse_matrix

CSR Matrix:


<3x3 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [75]:
print("CSR from DOK:")
csr_from_dok

CSR from DOK:


<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

You can perform element-wise operations on sparse matrices, but be cautious as some operations may convert the matrix to dense format internally, potentially causing memory issues.

In [77]:
# Element-wise operations
print("Element-wise multiplication:")
sparse_matrix.multiply(sparse_matrix)

Element-wise multiplication:


<3x3 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [79]:
# Matrix multiplication
print("Matrix multiplication:")
sparse_matrix.dot(sparse_matrix)


Matrix multiplication:


<3x3 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

🤔 **Why This Matters:** Understanding sparse matrices allows you to handle large, high-dimensional datasets efficiently, enabling you to work with problems that would be intractable with dense representations.


### <a id='toc3_5_'></a>[Best Practices and Considerations](#toc0_)


To work best with sparse matrices, consider the following:
1. **Memory Efficiency**: Use sparse matrices when your data has many zero elements (typically > 90% zeros).
2. **Algorithm Compatibility**: Check if the algorithm you're using supports sparse input. Some algorithms may require dense input or have specialized sparse implementations.
3. **Conversion Costs**: Be aware of the computational cost of converting between sparse and dense representations. Avoid unnecessary conversions.
4. **Appropriate Sparse Format**: Choose the right sparse format (CSR, CSC, etc.) based on your access patterns and operations.
5. **Sparse-Aware Feature Selection**: When working with sparse data, consider using sparse-aware feature selection methods to maintain sparsity.


❗️ **Important Note:** Not all operations are efficient on sparse matrices. Some operations may convert the matrix to dense format internally, potentially causing memory issues for large datasets.


By leveraging sparse matrices effectively, you can work with larger datasets, implement more complex models, and solve problems that would be impractical with dense representations. This knowledge is particularly valuable when dealing with text data, high-dimensional feature spaces, or large-scale machine learning tasks in Scikit-learn.

## <a id='toc4_'></a>[Working with Pandas DataFrames in Scikit-learn](#toc0_)

Pandas DataFrames are a popular and powerful data structure for handling structured data in Python. While Scikit-learn primarily works with NumPy arrays, it also provides seamless integration with Pandas DataFrames. Understanding how to effectively use DataFrames with Scikit-learn can greatly enhance your data preprocessing and machine learning workflow.


Pandas DataFrames are two-dimensional labeled data structures with columns of potentially different types, similar to a spreadsheet or SQL table.


In [80]:
import pandas as pd
import numpy as np

# Creating a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': [1.1, 2.2, 3.3, 4.4, 5.5],
    'D': [True, False, True, False, True]
})

df

Unnamed: 0,A,B,C,D
0,1,a,1.1,True
1,2,b,2.2,False
2,3,c,3.3,True
3,4,d,4.4,False
4,5,e,5.5,True


Advantages of using DataFrames with Scikit-learn:
1. **Labeled Data**: DataFrames provide column names, making it easier to keep track of features.
2. **Mixed Data Types**: DataFrames can handle multiple data types in different columns.
3. **Built-in Data Manipulation**: Pandas offers powerful data manipulation tools.
4. **Easy Data Inspection**: DataFrames provide convenient methods for data exploration.


### <a id='toc4_1_'></a>[Converting DataFrames to Scikit-learn Compatible Format](#toc0_)


Most Scikit-learn estimators expect numpy arrays or scipy sparse matrices as input. Here's how to convert DataFrames for numerical data:


In [81]:
from sklearn.preprocessing import StandardScaler

# Selecting only numeric columns
numeric_features = ['A', 'C']
X = df[numeric_features]

# Converting to numpy array
X_array = X.values

# Scaling the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_array)

In [82]:
print("Original DataFrame:")
X


Original DataFrame:


Unnamed: 0,A,C
0,1,1.1
1,2,2.2
2,3,3.3
3,4,4.4
4,5,5.5


In [83]:
print("Numpy Array:")
X_array

Numpy Array:


array([[1. , 1.1],
       [2. , 2.2],
       [3. , 3.3],
       [4. , 4.4],
       [5. , 5.5]])

In [84]:
print("Scaled Array:")
X_scaled

Scaled Array:


array([[-1.41421356, -1.41421356],
       [-0.70710678, -0.70710678],
       [ 0.        ,  0.        ],
       [ 0.70710678,  0.70710678],
       [ 1.41421356,  1.41421356]])

For categorical data, you can use OneHotEncoder.


In [86]:
from sklearn.preprocessing import OneHotEncoder

# Selecting categorical column
cat_features = ['B']
X_cat = df[cat_features]

# One-hot encoding
encoder = OneHotEncoder(sparse_output=False)
X_encoded = encoder.fit_transform(X_cat)

In [87]:
print("Original Categorical Data:")
X_cat

Original Categorical Data:


Unnamed: 0,B
0,a
1,b
2,c
3,d
4,e


In [88]:
print("One-hot Encoded Data:")
X_encoded

One-hot Encoded Data:


array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

### <a id='toc4_2_'></a>[Using DataFrames Directly with Scikit-learn](#toc0_)


Some Scikit-learn estimators can work directly with DataFrames, especially those that support feature names.


In [90]:
from sklearn.impute import SimpleImputer

In [91]:
# Creating a DataFrame with missing values
df_missing = pd.DataFrame(
    {"A": [1, 2, np.nan, 4], "B": [5, np.nan, 7, 8], "C": [9, 10, 11, np.nan]}
)

imputer = SimpleImputer(strategy="mean")
df_imputed = pd.DataFrame(imputer.fit_transform(df_missing), columns=df_missing.columns)


In [92]:
print("Original DataFrame with missing values:")
df_missing

Original DataFrame with missing values:


Unnamed: 0,A,B,C
0,1.0,5.0,9.0
1,2.0,,10.0
2,,7.0,11.0
3,4.0,8.0,


In [93]:
print("Imputed DataFrame:")
df_imputed

Imputed DataFrame:


Unnamed: 0,A,B,C
0,1.0,5.0,9.0
1,2.0,6.666667,10.0
2,2.333333,7.0,11.0
3,4.0,8.0,10.0


💡 **Pro Tip:** When using DataFrames directly, Scikit-learn preserves feature names in many preprocessing steps, making it easier to interpret results.


### <a id='toc4_3_'></a>[Feature Selection with DataFrames](#toc0_)


Scikit-learn's feature selection methods can work with DataFrames, maintaining feature names:


In [94]:
from sklearn.feature_selection import SelectKBest, f_regression

# Creating a DataFrame for feature selection
df_features = pd.DataFrame({
    'A': np.random.rand(100),
    'B': np.random.rand(100),
    'C': np.random.rand(100),
    'D': np.random.rand(100)
})
y = np.random.rand(100)  # Target variable

# Selecting top 2 features
selector = SelectKBest(f_regression, k=2)
X_selected = selector.fit_transform(df_features, y)

# Getting selected feature names
selected_features = df_features.columns[selector.get_support()].tolist()

In [95]:
print("Selected Features:", selected_features)
print("Selected Data Shape:", X_selected.shape)

Selected Features: ['A', 'B']
Selected Data Shape: (100, 2)


### <a id='toc4_4_'></a>[Pipelines with DataFrames](#toc0_)


Scikit-learn pipelines can be particularly powerful when working with DataFrames, allowing you to chain multiple preprocessing steps and models while maintaining feature names.


In [96]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Prepare data
X = pd.DataFrame({
    'numeric1': np.random.rand(100),
    'numeric2': np.random.rand(100),
    'categorical': np.random.choice(['A', 'B', 'C'], 100)
})
y = np.random.randint(0, 2, 100)  # Binary target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(
        transformers=[
            ('num', StandardScaler(), ['numeric1', 'numeric2']),
            ('cat', OneHotEncoder(drop='first'), ['categorical'])
        ])),
    ('classifier', LogisticRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

print("Pipeline Score:", pipeline.score(X_test, y_test))

Pipeline Score: 0.45


🤔 **Why This Matters:** Using pipelines with DataFrames allows you to create a clean, reproducible workflow that handles both preprocessing and modeling steps seamlessly.


To work best with DataFrames, consider the following best practices:

1. **Consistent Data Types**: Ensure your DataFrame columns have consistent data types (e.g., all numeric or all categorical).
2. **Handle Missing Data**: Address missing values before passing DataFrames to Scikit-learn estimators.
3. **Feature Names**: Leverage feature names in DataFrames for better interpretability of your models.
4. **Use ColumnTransformer**: For mixed data types, use ColumnTransformer to apply different preprocessing steps to different columns.
5. **Check Estimator Compatibility**: Verify if the Scikit-learn estimator you're using can directly handle DataFrames or if conversion to numpy array is necessary.


❗️ **Important Note:** While many Scikit-learn functions work well with DataFrames, always check the documentation of specific estimators or functions to ensure compatibility and proper usage.


By effectively integrating Pandas DataFrames with Scikit-learn, you can create more intuitive, readable, and maintainable machine learning workflows. This approach combines the data manipulation power of Pandas with the machine learning capabilities of Scikit-learn, allowing you to handle complex datasets with ease and clarity.

## <a id='toc5_'></a>[Summary](#toc0_)

This lecture has covered the essential aspects of data representation in Scikit-learn, providing you with a comprehensive understanding of how to effectively prepare and handle data for machine learning tasks. Let's recap the key points:

1. **Feature Matrices and Target Vectors**
   - Feature matrix (X): 2D array with shape (n_samples, n_features)
   - Target vector (y): 1D array with shape (n_samples,)
   - Consistency between X and y is crucial for proper model training

2. **Handling Different Data Types**
   - Numerical data: Can be used directly, often benefits from scaling
   - Categorical data: Requires encoding (ordinal or one-hot)
   - Text data: Needs vectorization (e.g., CountVectorizer, TfidfVectorizer)

3. **Sparse Matrices**
   - Efficient for high-dimensional data with many zero values
   - Commonly used in text processing and large-scale machine learning
   - CSR and CSC formats are most frequently used in Scikit-learn

4. **Working with Pandas DataFrames**
   - Provides labeled data structure with mixed data types
   - Can be used directly with some Scikit-learn estimators
   - Enables seamless integration of data manipulation and machine learning workflows


Proper data representation is fundamental to successful machine learning. It impacts model performance, computational efficiency, and interpretability of results.

To further enhance your skills in data representation with Scikit-learn:
1. Practice with diverse datasets to gain experience with different data types and structures
2. Experiment with various preprocessing techniques and observe their impact on model performance
3. Explore advanced feature engineering methods to create more informative representations of your data
4. Stay updated with Scikit-learn's documentation, as new features and optimizations are regularly added


Mastering data representation in Scikit-learn is crucial for building effective machine learning models. It enables you to handle a wide range of data types and sizes, optimize computational resources, and create more robust and interpretable models.


By understanding and applying these concepts of data representation, you are well-equipped to tackle complex machine learning tasks efficiently and effectively using Scikit-learn. Remember, the quality and appropriateness of your data representation can often be as important as the choice of algorithm in determining the success of your machine learning projects.