# Scipy Sparse Matrix

A matrix is sparse if many of its coefficients are zero. The interest in sparsity arises because its exploitation can lead to enormous computational savings and because many large matrix problems that occur in practice are sparse.

In [3]:
# dense to sparse
from numpy import array
from scipy.sparse import csr_matrix
# create dense matrix
A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
print(A)

[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]


In [4]:
# convert to sparse matrix (CSR method)
S = csr_matrix(A)
print(S)

  (0, 0)	1
  (0, 3)	1
  (1, 2)	2
  (1, 5)	1
  (2, 3)	2


In [5]:
# reconstruct dense matrix
B = S.todense()
print(B)

[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]


NumPy does not provide a function to calculate the sparsity of a matrix.

Nevertheless, we can calculate it easily by first finding the density of the matrix and subtracting it from one. The number of non-zero elements in a NumPy array can be given by the `count_nonzero()` function and the total number of elements in the array can be given by the size property of the array. Array sparsity can therefore be calculated as

`sparsity = 1.0 - count_nonzero(A) / A.size`

In [6]:
# calculate sparsity
from numpy import array
from numpy import count_nonzero
# create dense matrix
A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
print(A)

[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]


In [7]:
A.size

18

In [8]:
count_nonzero(A)

5

In [9]:
# calculate sparsity
sparsity = 1.0 - count_nonzero(A) / A.size
print(sparsity)

0.7222222222222222


In [2]:
# w3school
import numpy as np
from scipy.sparse import csr_matrix # Compressed Sparse Row

arr = np.array([0, 0, 0, 0, 0, 1, 1, 0, 2])

print(csr_matrix(arr))

  (0, 5)	1
  (0, 6)	1
  (0, 8)	2


In [12]:
sparr = csr_matrix(arr)
type(sparr)

scipy.sparse._csr.csr_matrix

In [13]:
sparr.get_shape()

(1, 9)

In [18]:
sparr[0,8]

2

## ChatGPTs Primer on Working with SciPy Sparse Matrices

Below is a primer on working with SciPy sparse matrices (and the newer sparse arrays) with a focus on slicing, extracting values, and general tips for usage. We’ll cover:

## 1. Overview of SciPy Sparse Matrices

For large, mostly-empty 2D data, **sparse matrices** are a memory-efficient alternative to NumPy dense arrays. Instead of storing every single element, sparse matrices store **only the nonzero entries** along with their positions. This makes computations on large matrices with few nonzero entries more feasible.

The primary sparse matrix library in Python is in `scipy.sparse`. Since SciPy 1.8+, there is also a concept of **“sparse arrays”**, which extends some functionality to n-dimensional arrays, though in practice most operations remain 2D-oriented.

## 2. Common Sparse Formats

SciPy supports a variety of sparse formats, each optimized for different operations. The most common include:

1. **CSR (Compressed Sparse Row**; `csr_matrix`):

    - Stores row-wise slices of nonzero elements.
    - Great for row slicing and efficient arithmetic (e.g., matrix-vector multiplies).

2. **CSC (Compressed Sparse Column**; `csc_matrix`):

    - Stores column-wise slices of nonzero elements.
    - Great for column slicing and some advanced math operations.

3. **COO (Coordinate**; `coo_matrix`):

    - Stores a list of `(row, col, data)` tuples.
    - Easiest to build incrementally but not optimal for arithmetic; often used for **construction** then converted to CSR/CSC.

Other formats exist (e.g., **DOK, LIL**) but are less commonly used in production code, except for specialized tasks like incremental matrix building.

## 3. Creating Sparse Matrices

### 3.1 From Dense Arrays

In [1]:
import numpy as np
from scipy.sparse import csr_matrix

# Suppose we have a dense NumPy array:
dense = np.array([
    [0, 0, 3],
    [4, 0, 0],
    [0, 5, 0],
], dtype=float)

# Convert to a CSR matrix:
sparse_csr = csr_matrix(dense)
print(sparse_csr)
# Output (in CSR format):
#   (0, 2)    3.0
#   (1, 0)    4.0
#   (2, 1)    5.0


  (0, 2)	3.0
  (1, 0)	4.0
  (2, 1)	5.0


### 3.2 Directly via Format Constructors

In [2]:
from scipy.sparse import coo_matrix

rows = np.array([0, 1, 2])
cols = np.array([2, 0, 1])
data = np.array([3.0, 4.0, 5.0])

# Construct directly in COO format:
sparse_coo = coo_matrix((data, (rows, cols)), shape=(3,3))
sparse_coo

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>

In [3]:
sparse_coo.toarray()

array([[0., 0., 3.],
       [4., 0., 0.],
       [0., 5., 0.]])

In [4]:
sparse_coo.tocsr()

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [5]:
print(sparse_coo.tocsr())

  (0, 2)	3.0
  (1, 0)	4.0
  (2, 1)	5.0


### 3.3 Converting Between Formats

In [None]:
sparse_csc = sparse_csr.tocsc()
sparse_coo = sparse_csr.tocoo()
# ... etc.


## 4. Basic Indexing and Slicing

When dealing with **2D SciPy sparse matrices** (e.g., `csr_matrix`, `csc_matrix`), indexing works similarly to NumPy, but with some important differences regarding performance and returned types.

### 4.1 Single Element Access

In [6]:
M = csr_matrix([
    [0, 2, 0],
    [4, 0, 6],
    [0, 8, 0],
], dtype=float)

val = M[1, 2]  # Access element at row=1, col=2 (zero-based)
print(val)     # 6.0


6.0


In [8]:
print(M)

  (0, 1)	2.0
  (1, 0)	4.0
  (1, 2)	6.0
  (2, 1)	8.0


In [9]:
type(val)

numpy.float64

Access by `M[i, j]` returns a **scalar** (not a 1x1 sparse matrix in newer SciPy versions).

**Caution**: Single-element access in sparse matrices can be expensive in some formats, since it may involve a search among the nonzero indices. If you need many random accesses, consider converting to CSR or CSC (depending on row/column access patterns), or even to a dense array if it fits in memory.

### 4.2 Row Slicing and Column Slicing

In [13]:
# Row slicing: M[ start_row : end_row, : ]
row_slice = M[0:2, :]  # rows 0 and 1, all columns
print("Row Slice:")
print(row_slice)
#   (0, 1)    2.0
#   (1, 0)    4.0
#   (1, 2)    6.0

# Column slicing: M[:, start_col : end_col]
col_slice = M[:, 1:3]  # all rows, columns 1 and 2
print("Column Slice:")
print(col_slice)
#   (0, 0)    2.0
#   (1, 1)    6.0
#   (2, 0)    8.0


Row Slice:
  (0, 1)	2.0
  (1, 0)	4.0
  (1, 2)	6.0
Column Slice:
  (0, 0)	2.0
  (1, 1)	6.0
  (2, 0)	8.0


**Note**: The output of a slice will often maintain the same sparse format as the original (e.g., `csr_matrix` → `csr_matrix`), but behind the scenes, it may create a new sparse object referencing only the sliced region (pay attention to the indexing of the column slice above - it's re-indexed).

In [14]:
col_slice

<3x2 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [15]:
M

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in Compressed Sparse Row format>

In [16]:
row_slice

<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

### 4.3 Mixed Indexing

In [19]:
# Range of rows, single column
print("Original M:")
print(M)
submatrix = M[0:3, 1]  # All rows, col=1
# This typically returns a (3x1) sparse column submatrix
print("M[0:3, 1]")
print(submatrix)
#   (0, 0)  2.0
#   (1, 0)  0.0
#   (2, 0)  8.0

# Range of rows, range of columns
submatrix = M[0:2, 1:3]
print("M[0:2, 1:3]")
print(submatrix)

Original M:
  (0, 1)	2.0
  (1, 0)	4.0
  (1, 2)	6.0
  (2, 1)	8.0
M[0:3, 1]
  (0, 0)	2.0
  (2, 0)	8.0
M[0:2, 1:3]
  (0, 0)	2.0
  (1, 1)	6.0


### 4.4 Fancy Indexing (Arrays of Indices)

In [23]:
rows_idx = np.array([2, 0])  # reorder row indices
cols_idx = np.array([1, 2])
submatrix = M[rows_idx[:, None], cols_idx]  
# shape will be (2, 2)

print("Original M")
print(M)
print("Submatrix:")
print(submatrix)
#   (0, 0)    8.0  # From original M[2,1]
#   (1, 1)    0.0  # From original M[0,2]


Original M
  (0, 1)	2.0
  (1, 0)	4.0
  (1, 2)	6.0
  (2, 1)	8.0
Submatrix:
  (0, 0)	8.0
  (1, 0)	2.0


In [21]:
print(submatrix.toarray())

[[8. 0.]
 [2. 0.]]


Be aware that “fancy” indexing can be significantly more expensive in sparse formats, as it often constructs a **COO** intermediate or does repeated lookups.

## 5. Advanced Slicing (Efficiency Concerns)

- **CSR** is well-suited for slicing entire rows or contiguous row ranges (`M[i:j, :]`) because data is stored row-wise.
- **CSC** is well-suited for slicing entire columns or contiguous column ranges (`M[:, i:j]`) because data is stored column-wise.
- If you need to slice irregular subsets of rows/columns or use a lot of fancy indexing, consider:
    - Converting to a more convenient format (`.tocoo()` or `.todense()`) if memory permits.
    - Using specialized indexing or iteration patterns (e.g., by iterating over `.nonzero()` or `.data`, `.indices`, `.indptr`).

## 6. Inspecting Internal Representation

### 6.1 CSR Internals

For a **CSR** matrix `M`, the important attributes are:

- `M.data`: The nonzero values (1D array).
- `M.indices`: The column indices for each nonzero in `M.data`.
- `M.indptr`: An array of length `n_rows+1`, where `M.indptr[i]` gives the start index in `M.data` for row i.

Example:

In [None]:
M = csr_matrix([
    [0, 2, 0],
    [4, 0, 6],
    [0, 8, 0],
], dtype=float)

print(M.data)      # [2. 4. 6. 8.]
print(M.indices)   # [1 0 2 1]
print(M.indptr)    # [0 1 3 4]

# M.indptr = [0, 1, 3, 4]
# Row 0's nonzeros are in M.data[0:1]  -> (col=1, val=2)
# Row 1's nonzeros are in M.data[1:3]  -> (col=0, val=4), (col=2, val=6)
# Row 2's nonzeros are in M.data[3:4]  -> (col=1, val=8)



### 6.2 COO Internals

For a **COO** matrix `cooM`, the attributes are:

- `cooM.row`: The row indices of each nonzero value.
- `cooM.col`: The column indices of each nonzero value.
- `cooM.data`: The nonzero values (1D).
- 
COO is very handy for constructing a sparse matrix from scratch, but less efficient for repeated arithmetic or indexing.

## 7. Converting Between Sparse and Dense

### 7.1 To a Dense NumPy Array

- `M.toarray()` returns a standard 2D NumPy array.
- `M.A` is shorthand for M.toarray() (though sometimes discouraged for clarity reasons).

In [24]:
dense_again = M.toarray()
print(dense_again)
# [[0. 2. 0.]
#  [4. 0. 6.]
#  [0. 8. 0.]]


[[0. 2. 0.]
 [4. 0. 6.]
 [0. 8. 0.]]


### 7.2 From a Dense NumPy Array

Use the constructor methods or `scipy.sparse` helper functions:

In [26]:
from scipy.sparse import coo_matrix, csc_matrix
csr_matrix(dense_again)
csc_matrix(dense_again)
coo_matrix(dense_again)

<3x3 sparse matrix of type '<class 'numpy.float64'>'
	with 4 stored elements in COOrdinate format>

**Caution**: Converting large sparse matrices to dense can cause out-of-memory issues if the matrix is huge. Always estimate memory usage first.

# Miscellaneous 

In [None]:
from textblob import TextBlob

text = "The product is amazing and works great!"
analysis = TextBlob(text)
print(analysis.sentiment)

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
analyzer = vectorizer.build_analyzer()

analyzer("This is a text document to analyze.")

['this', 'is', 'text', 'document', 'to', 'analyze']

In [2]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [3]:
vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

In [4]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [6]:
X.rows()

AttributeError: 'csr_matrix' object has no attribute 'rows'

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [8]:
X.toarray()

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

In [9]:
corpus

['This is the first document.',
 'This is the second second document.',
 'And the third one.',
 'Is this the first document?']

In [10]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

## Get Feature Names and Lookup Columns

In [12]:
# get feature names
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [14]:
# find the index of a specific word
word = "second"
col_index = list(feature_names).index(word)
col_index

5

In [15]:
# access the value
X[1, col_index]

0.8532257361452784

In [17]:
X.get_feature_names_out()

AttributeError: 'csr_matrix' object has no attribute 'get_feature_names_out'

In [22]:
X.getrow(1).toarray()

array([[0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147]])

In [20]:
X[0, 1]

0.4387767428592343

In scikit-learn’s **CountVectorizer**, you can easily retrieve:

1. The mapping from token to index (often called a “vocabulary”).
2. The mapping from index back to token (“reverse vocabulary” or index-to-word mapping).

Below is a quick example:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# 1. Word to index
word_to_index = vectorizer.vocabulary_
print("Word to index mapping:")
print(word_to_index)

# 2. Index to word
index_to_word = {index: word for word, index in word_to_index.items()}
print("\nIndex to word mapping:")
print(index_to_word)


Word to index mapping:
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

Index to word mapping:
{8: 'this', 3: 'is', 6: 'the', 2: 'first', 1: 'document', 5: 'second', 0: 'and', 7: 'third', 4: 'one'}
