# Lesson 4: Efficient Text Data Representation with Sparse Matrices

Hello and welcome to this lesson on "Efficient Text Data Representation with Sparse Matrices". As you recall, in our previous lessons, we transformed raw text data into numerical features, for example, using the Bag-of-Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) techniques. These transformation methods often create what we call "Sparse Matrices," an incredibly memory-efficient way of storing high-dimensional data.

Let's break this down a bit. In the context of text data, each unique word across all documents could be treated as a distinct feature. However, each document will only include a small subset of these available features or unique words. Meaning, most entries in our feature matrix end up being 0s, hence resulting in a sparse matrix.

We'll begin with a simple non-text matrix to illustrate sparse matrices and later connect this knowledge to our journey on text data transformation.

```python
import numpy as np
from scipy.sparse import csr_matrix, csc_matrix, coo_matrix

# Simple example matrix
vectors = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 7, 0, 8, 0]
])
```

## Sparse Matrix Formats: CSR

In this section, we'll investigate how we can handle sparse matrices in different formats including: Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), and the Coordinate (COO) formats.

We'll start with the CSR format, a common format for sparse matrices that is excellent for quick arithmetic operations and matrix vector calculations.

```python
# CSR format
sparse_csr = csr_matrix(vectors)
print("Compressed Sparse Row (CSR) Matrix:\n", sparse_csr)
```

The output of the above code will be:

```
Compressed Sparse Row (CSR) Matrix:
   (0, 2)    2
  (0, 3)    3
  (1, 0)    4
  (1, 4)    6
  (4, 1)    7
  (4, 3)    8
```

Observe that in the output of the Compressed Sparse Row representation, it records the values in the matrix row-wise, starting from the top. Each entry (0, 2), for example, tells us that the element in the 0th row and 2nd column is 2.

## Sparse Matrix Formats: CSC

Next, let's convert our vectors matrix to the CSC format. This format, like the CSR format, also forms the backbone of many operations we perform on sparse matrices. But it stores the non-zero entries column-wise, and is especially efficient for column slicing operations.

```python
# CSC format
sparse_csc = csc_matrix(vectors)
print("Compressed Sparse Column (CSC) Matrix:\n", sparse_csc)
```

The output of the above code will be:

```
Compressed Sparse Column (CSC) Matrix:
   (1, 0)    4
  (4, 1)    7
  (0, 2)    2
  (0, 3)    3
  (4, 3)    8
  (1, 4)    6
```

In this Compressed Sparse Column output, the non-zero entries are stored column-wise. Essentially, CSC format is a transpose of the CSR format.

## Sparse Matrix Formats: COO

Lastly, let's convert our example to the COO format or Coordinate List format. The COO format is another useful way to represent a sparse matrix and is simpler compared to CSR or CSC formats.

```python
# COO format
sparse_coo = coo_matrix(vectors)
print("Coordinate Format (COO) Matrix:\n", sparse_coo)
```

The output of the above code will be:

```
Coordinate Format (COO) Matrix:
   (0, 2)    2
  (0, 3)    3
  (1, 0)    4
  (1, 4)    6
  (4, 1)    7
  (4, 3)    8
```

In the COO format, or Coordinate format, the non-zero entries are represented by their own coordinates (row, column). Unlike CSC or CSR, the COO format can contain duplicate entries. This can be particularly useful when data is being accumulated in several passes and there might be instances where duplicate entries are generated. These duplicates are not immediately merged in the COO format, providing you with flexibility for subsequent processing like duplicate resolution.

## Vectorized Operations: CSR and CSC

Sparse matrices are not just memory-efficient storage mechanisms, but they also allow us to conduct operations directly on them. Specifically, the CSR and CSC formats support these operations directly, whereas the COO format requires converting back to CSR or CSC first.

Let's see this in practice when performing a multiplication operation.

```python
# Running operations on CSR and CSC matrices
weighted_csr = sparse_csr.multiply(0.5)
print("Weighted CSR:\n", weighted_csr.toarray())
```

The output of the code block above will be:

```
Weighted CSR:
 [[0.  0.  1.  1.5 0. ]
 [2.  0.  0.  0.  3. ]
 [0.  3.5 0.  4.  0. ]
 [0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0. ]]
```

## Vectorized Operations: COO

And now let's demonstrate the process of performing the same multiplication operation on the COO format, but this time requiring conversion to CSR or CSC first.

```python
# Operation on COO requires conversion to CSR or CSC first
weighted_coo = sparse_coo.tocsr().multiply(0.5)
print("Weighted COO:\n", weighted_coo.toarray())
```

The output of the above code will be:

```
Weighted COO:
 [[0.  0.  1.  1.5 0. ]
 [2.  0.  0.  0.  3. ]
 [0.  3.5 0.  4.  0. ]
 [0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0. ]]
```

## The Connection Between Sparse Matrices and NLP

After going through the concepts and code, you might ask - what does all this have to do with NLP? Well, remember when we transformed raw text data into either a Bag-of-Words or a TF-IDF representation in the previous lessons? Each unique word across all documents was treated as a distinct feature. Given the high dimensionality and inherent sparsity of the resulting feature representation, we used sparse matrices for efficient storage.

Handling of sparse matrices becomes crucial in large NLP tasks, as they allow us to operate on large datasets while maintaining computational efficiency and optimal memory usage. Therefore, understanding these different formats of sparse matrices is an essential part of your feature engineering skills for text classification.

## Lesson Summary

Congratulations! Today, you gained an insight into sparse matrices and their different formats, how they help efficiently storing and operating on high dimensional data like that of text records in NLP. You also explored the implications of implementing vectorized operations on different sparse matrix formats. Structuring your learning and understanding these formats is paramount to efficiently handle large datasets in NLP and other machine learning tasks. In the upcoming exercises, you'll get hands-on experience with these concepts, reinforcing your understanding further. Keep up the momentum and dive into practice!

## Switching from CSC to CSR Representation

Great job, Space Voyager! Now, try altering the given code to create a Compressed Sparse Row (CSR) matrix instead of a Compressed Sparse Column (CSC) matrix. Observe how the output's matrix representation changes when you switch formats.

Let's continue to explore the vastness of feature engineering!

```python
from scipy.sparse import csc_matrix
import numpy as np

vectors = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# CSC format
sparse_csc = csc_matrix(vectors)
print("Compressed Sparse Column (CSC) Matrix:\n", sparse_csc)

```

To modify the given code to create a Compressed Sparse Row (CSR) matrix instead of a Compressed Sparse Column (CSC) matrix, you simply need to change the type of sparse matrix function from `csc_matrix` to `csr_matrix`. Below is the altered code snippet:

```python
from scipy.sparse import csr_matrix
import numpy as np

vectors = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# CSR format
sparse_csr = csr_matrix(vectors)
print("Compressed Sparse Row (CSR) Matrix:\n", sparse_csr)
```

This change will cause the matrix to be stored in row-major order rather than column-major order. The output will reflect this by listing non-zero entries by their row and column indices in a row-major sequence. The output will look like this:

```
Compressed Sparse Row (CSR) Matrix:
   (0, 2)    2
  (0, 3)    3
  (1, 0)    4
  (1, 4)    6
  (4, 1)    7
  (4, 3)    8
```

This output indicates that non-zero values are listed starting from the first row, moving through each row sequentially. This is different from the CSC format, where entries are listed column by column.

## Creating a Coordinate Format Matrix with Duplicates

Onward, Celestial Traveler!

In this mission, we will be constructing a COO matrix. Importantly, this task has a twist - our matrix will include duplicate entries at the same positions, and you'll see how COO format takes on this challenge by summing up such duplicates. Start by filling out the missing portions (____) of the Python script provided. You need to import the coo_matrix functionality and use it to generate the sparse matrix.

Let's conquer this mission!

```python
from ____.____ import coo_matrix
import numpy as np

# Manually defined data and indices
data = np.array([1, 1, 2, 2, 2])
row = np.array([0, 0, 1, 2, 2])
col = np.array([0, 0, 1, 2, 2])

# COO format
dup_sparse_coo = ____((data, (row, col)))
print("Coordinate Format (COO) Matrix with Duplicates:\n", dup_sparse_coo)

```

To complete the provided Python script, you need to fill in the missing parts to correctly import and use the `coo_matrix` from the SciPy library. Here's the corrected version of the script:

```python
from scipy.sparse import coo_matrix
import numpy as np

# Manually defined data and indices
data = np.array([1, 1, 2, 2, 2])
row = np.array([0, 0, 1, 2, 2])
col = np.array([0, 0, 1, 2, 2])

# COO format
dup_sparse_coo = coo_matrix((data, (row, col)))
print("Coordinate Format (COO) Matrix with Duplicates:\n", dup_sparse_coo)
```

In this script:
- The `coo_matrix` function is correctly imported from the `scipy.sparse` module.
- The `coo_matrix` is created by passing a tuple containing the `data` array and another tuple of `row` and `col` arrays. This specifies the values and their respective positions in the matrix.
- The print statement will output the COO matrix with duplicates.

When you run this script, the COO format does not automatically sum the duplicate entries when initially created. However, if you convert it to another format like CSR or CSC and then back to COO, or perform operations that require matrix consolidation, the duplicates will be summed. To see the summed values, you might want to convert it to an array or another format that consolidates these values automatically. For example:

```python
print("Consolidated COO Matrix:\n", dup_sparse_coo.toarray())
```

This will show the matrix as:

```
[[2 0 0]
 [0 2 0]
 [0 0 4]]
```

Here, the duplicates at positions (0,0) and (2,2) have been summed up to 2 and 4, respectively.

## Performing Vectorized Operations on Sparse Matrices

Incredible work, Stellar Navigator!

Now let's change gears; take the given array, convert it into a Compressed Sparse Column (CSC) matrix, and perform a vectorized operation on this sparse matrix.

Fill in the blanks (____) to complete the matrix conversion and operation.

```python
from scipy.sparse import csc_matrix
import numpy as np

vectors = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# Convert to CSC
____ = ____(vectors)

# Perform vectorized operation on the sparse CSC matrix
# Using / operator for division
div_csc = ____ / 3
print("Division Result:\n", ____.toarray())

```

To complete the provided Python script for converting an array into a Compressed Sparse Column (CSC) matrix and performing a division operation, you need to fill in the blanks appropriately. Here's the completed script:

```python
from scipy.sparse import csc_matrix
import numpy as np

vectors = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# Convert to CSC
sparse_csc = csc_matrix(vectors)

# Perform vectorized operation on the sparse CSC matrix
# Using / operator for division
div_csc = sparse_csc / 3
print("Division Result:\n", div_csc.toarray())
```

In this script:
- The `csc_matrix` function is used to convert the `vectors` array into a CSC format sparse matrix, which is stored in `sparse_csc`.
- A division operation is performed on `sparse_csc` by dividing each non-zero element by 3. This is a vectorized operation that efficiently operates directly on the non-zero elements of the sparse matrix.
- The resulting matrix `div_csc` is converted back to a dense array using the `.toarray()` method for display purposes.

When you run this script, you will see the output of the division operation, which should look like this:

```
[[0.  0.  0.66666667 1.         0.        ]
 [1.33333333 0.         0.         0.         2.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.33333333 0.         0.         0.        ]
 [0.         2.33333333 0.         2.66666667 0.        ]]
```

This output shows the original non-zero elements of the matrix divided by 3, demonstrating how CSC format can be used for efficient storage and computation on sparse data.

## Creating CSR Matrix from Larger Array

Astounding work, Stellar Navigator!

Now, let's get hands-on with a slightly larger array. Your mission is to convert this array into a CSR matrix. Fill in the missing part of the code in the TODO comment with the necessary line of Python script. Keep going!

```python
from scipy.sparse import csr_matrix
import numpy as np

# New array
vectors = np.array([
    [4, 0, 0, 3, 0, 0],
    [7, 15, 0, 0, 9, 0],
    [6, 0, 0, 0, 0, 11],
    [0, 8, 0, 0, 0, 0],
    [0, 0, 16, 0, 2, 0],
    [5, 0, 0, 0, 0, 10],
])

# TODO: Convert the numpy array 'vectors' to a CSR format matrix.
# Store the resulting matrix in the 'sparse_csr' variable.
# Then, print the resulting sparse CSR matrix.

```

To complete the given Python script, you need to convert the `vectors` numpy array into a Compressed Sparse Row (CSR) matrix using the `csr_matrix` function from the `scipy.sparse` module. Here's the completed code snippet:

```python
from scipy.sparse import csr_matrix
import numpy as np

# New array
vectors = np.array([
    [4, 0, 0, 3, 0, 0],
    [7, 15, 0, 0, 9, 0],
    [6, 0, 0, 0, 0, 11],
    [0, 8, 0, 0, 0, 0],
    [0, 0, 16, 0, 2, 0],
    [5, 0, 0, 0, 0, 10],
])

# Convert the numpy array 'vectors' to a CSR format matrix
sparse_csr = csr_matrix(vectors)
print("Sparse CSR Matrix:\n", sparse_csr)
```

This code correctly initializes a CSR matrix from the `vectors` array and prints the resulting sparse matrix. The `print` statement will output the internal representation of the CSR matrix, showing the positions and values of non-zero elements. This output will look something like this:

```
  (0, 0)	4
  (0, 3)	3
  (1, 0)	7
  (1, 1)	15
  (1, 4)	9
  (2, 0)	6
  (2, 5)	11
  (3, 1)	8
  (4, 2)	16
  (4, 4)	2
  (5, 0)	5
  (5, 5)	10
```

This output indicates the row and column indices along with the corresponding values of non-zero elements in the `vectors` matrix, stored in a space-efficient CSR format.

Great job, Stellar Navigator! Let's wrap things up. Given an array, convert it to Coordinate Format (COO) matrix, convert that to Compressed Sparse Row (CSR) format, and perform a Subtraction Operation on the non-zero elements. Finally, display the result. Get ready to teleport to success!

```python
from scipy.sparse import coo_matrix
import numpy as np

# You have the following defined array
array = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# TODO: Convert the array to a Coordinate Format (COO) Matrix.

# TODO: Convert the resulting COO matrix to a Compressed Sparse Row (CSR) format.

# TODO: Next, perform a Subtraction Operation on non-zero elements. Subtract 2 from the non-zero values.

# TODO: If necessary, convert the resultant matrix back to COO format.

# TODO: Finally, print the subtraction result

```

Ah, you're right! Subtracting a scalar directly from all elements of a sparse matrix, including zeros, isn't straightforward due to the nature of sparse storage. We should directly adjust the non-zero elements. Here's how to properly perform the subtraction operation on the non-zero elements of the CSR matrix:

```python
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np

# You have the following defined array
array = np.array([
    [0, 0, 2, 3, 0],
    [4, 0, 0, 0, 6],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 7, 0, 8, 0]
])

# Convert the array to a Coordinate Format (COO) Matrix
coo = coo_matrix(array)

# Convert the resulting COO matrix to a Compressed Sparse Row (CSR) format
csr = coo.tocsr()

# Next, perform a Subtraction Operation on non-zero elements. Subtract 2 from the non-zero values.
# We manipulate the data array directly
csr.data = csr.data - 2

# Convert back to COO to view changes easily
result_coo = csr.tocoo()

# Finally, print the subtraction result
print("Subtraction Result in COO format:\n", result_coo)
```

### Revised Steps:
1. **Convert to COO**: Start by converting the `array` into a COO matrix.
2. **Convert to CSR**: Change the COO matrix to CSR format, which is more efficient for row slicing and operations.
3. **Subtract from Non-zero Elements**: Directly modify the `data` attribute of the CSR matrix. This attribute holds the values of the non-zero elements, and by subtracting 2, we only affect these values, leaving the structure of zeros intact.
4. **Convert Back to COO**: For easier viewing and verification, convert the modified CSR back to COO format.
5. **Print the Result**: Display the resulting matrix, which now reflects the subtraction of 2 from each non-zero value.

This approach ensures that only the non-zero values are modified, maintaining the efficiency and integrity of the sparse matrix representation.