<a href="https://colab.research.google.com/github/tfindiamooc/mlp/blob/main/TextAnalysisClass1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson #1: Getting Started with Text Data - Preprocessing Basics in Scikit-learn

What to expect:

* **Top-Down & Code-First**: We'll jump into code immediately and then explain the concepts as we go. Students will see working code from the first few minutes.
* **Start Simple, Iterate**: We begin with very basic preprocessing and gradually add more steps.
Real-World Relevance (Implicit): We'll frame text data as something encountered in everyday applications (reviews, articles, etc.), and the practical exercises will build towards real-world text analysis tasks in later lessons.
* **Encourage Experimentation**: The lesson will include prompts for students to modify the code and observe the results.
* **Focus on Practical Skills**: By the end of this lesson, students will be able to write Python code using scikit-learn to perform basic text preprocessing.



## Introduction - What is Text Data and Why Preprocess?

* Briefly explain what text data is and why it's important in today's world (examples: online reviews, social media, news articles, documents).

* Highlight that raw text is messy and needs cleaning before machine learning models can understand it. Analogy: "Like cleaning ingredients before cooking!"

* Mention that in this lesson, we'll start with the cleaning part – preprocessing.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first sentence.",
    "This one is the second sentence.",
    "Sentence number three.",
    "Is this the first sentence?",
]

vectorizer = CountVectorizer()
vectorizer.fit(corpus) # "Learn" vocabulary from the corpus
X = vectorizer.transform(corpus) # "Transform" text to numbers

print("Vocabulary:")
print(vectorizer.get_feature_names_out()) # Show the words it learned

print("\nDocument-Term Matrix (Counts):")
print(X.toarray()) # Show the numerical representation of the text

Vocabulary:
['first' 'is' 'number' 'one' 'second' 'sentence' 'the' 'this' 'three']

Document-Term Matrix (Counts):
[[1 1 0 0 0 1 1 1 0]
 [0 1 0 1 1 1 1 1 0]
 [0 0 1 0 0 1 0 0 1]
 [1 1 0 0 0 1 1 1 0]]


Run the code and show the output.
* Explain `CountVectorizer` in simple terms: "*It's like a tool that counts words.*"

* **Vocabulary**: "Look at 'Vocabulary' - these are all the unique words CountVectorizer found in our text."

* **Document-Term Matrix**: "This matrix shows how many times each word appears in each sentence. Each row is a sentence, each column is a word from the vocabulary."

* Emphasize: "Machine learning models work with numbers, not text. `CountVectorizer` helps us convert text to numbers!"

In [2]:
import pandas as pd

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,first,is,number,one,second,sentence,the,this,three
0,1,1,0,0,0,1,1,1,0
1,0,1,0,1,1,1,1,1,0
2,0,0,1,0,0,1,0,0,1
3,1,1,0,0,0,1,1,1,0


### Adding Preprocessing - Lowercasing and Punctuation Removal

**Problem**:  Point out issues with the initial tokenization: "Sentence" and "sentence" are treated as different words. Punctuation is included.  "This is not ideal for our models."

**Solution** - *Lowercasing*

#### Lowercasing

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first Sentence.",
    "This one is the second sentence.",
    "Sentence number three.",
    "Is this the first sentence?",
]

vectorizer = CountVectorizer(lowercase=True) # Add lowercase=True
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)

print("Vocabulary (Lowercased):")
print(vectorizer.get_feature_names_out())
print("\nDocument-Term Matrix (Lowercased):")
print(X.toarray())

Vocabulary (Lowercased):
['first' 'is' 'number' 'one' 'second' 'sentence' 'the' 'this' 'three']

Document-Term Matrix (Lowercased):
[[1 1 0 0 0 1 1 1 0]
 [0 1 0 1 1 1 1 1 0]
 [0 0 1 0 0 1 0 0 1]
 [1 1 0 0 0 1 1 1 0]]


Explanation: "We added `lowercase=True` inside
`CountVectorizer`.

Run it again. See?

Now 'sentence' and 'Sentence' are treated the same!"

#### Remove punctuations

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first sentence.",
    "This one is the second sentence.",
    "Sentence number three!", # Added punctuation
    "Is this the first sentence?",
]

vectorizer = CountVectorizer(lowercase=True, token_pattern=r'[a-zA-Z]+') # Added token_pattern
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)

print("Vocabulary (No Punctuation, Lowercased):")
print(vectorizer.get_feature_names_out())
print("\nDocument-Term Matrix (No Punctuation, Lowercased):")
print(X.toarray())

Vocabulary (No Punctuation, Lowercased):
['first' 'is' 'number' 'one' 'second' 'sentence' 'the' 'this' 'three']

Document-Term Matrix (No Punctuation, Lowercased):
[[1 1 0 0 0 1 1 1 0]
 [0 1 0 1 1 1 1 1 0]
 [0 0 1 0 0 1 0 0 1]
 [1 1 0 0 0 1 1 1 0]]


* **Explanation**: "We added token_pattern=r'[a-zA-Z]+'. This tells CountVectorizer to only keep words made of letters. Punctuation is gone!"
* Explain token_pattern briefly as "a rule for what counts as a word." (No need to go deep into regex at this stage, just the practical effect.)

#### Stop Word Removal - Using Built-in Stop Words in `CountVectorizer`

**Problem**:  Point out common words like "is," "the," "this," "one" are very frequent but might not be very informative for analysis. These are called 'stop words'.

**Solution** - Stop Word Removal

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "This is the first sentence.",
    "This one is the second sentence.",
    "Sentence number three!",
    "Is this the first sentence?",
]

vectorizer = CountVectorizer(
    lowercase=True,
    token_pattern=r'[a-zA-Z]+',
    stop_words='english') # Added stop_words='english'
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)

print("Vocabulary (No Punctuation, Lowercased, Stop Words Removed):")
print(vectorizer.get_feature_names_out())
print("\nDocument-Term Matrix (No Punctuation, Lowercased, Stop Words Removed):")
print(X.toarray())

Vocabulary (No Punctuation, Lowercased, Stop Words Removed):
['number' 'second' 'sentence']

Document-Term Matrix (No Punctuation, Lowercased, Stop Words Removed):
[[0 0 1]
 [0 1 1]
 [1 0 1]
 [0 0 1]]


We added stop_words='english'. CountVectorizer now automatically removes common English words. Run it and see how the vocabulary is smaller and common words like 'is', 'the' are gone!

`stop_words='english'` uses a predefined list. We can customize this later if needed.

## Experimentation

**Experiment # 1**: Change the corpus to different sentences. What happens to the vocabulary and the matrix?

In [None]:
# your code here

**Experiment #2**:Remove `lowercase=True`. What changes?

In [None]:
# your code here

**Experiment #3**: Remove `stop_words='english'`. What comes back in the vocabulary?

In [None]:
# your code here

**Experiment #4**: Try different token_pattern values.

In [None]:
# your code here

## Summary:
* In this lesson, we learned how to take raw text and use `CountVectorizer` to convert it into numbers that machine learning models can use.

* We covered basic preprocessing steps: lowercasing, punctuation removal, and stop word removal, all using `CountVectorizer`.

* This is just the beginning! In the next lessons, we'll use these numerical representations to build actual text classification models!




Next time, we'll learn about TF-IDF and then start building models like Logistic Regression!

## Key Takeaways

* Text data needs preprocessing to be used in machine learning.
* `CountVectorizer` in scikit-learn is a powerful tool for tokenization and basic preprocessing.
* We can control preprocessing steps like lowercasing, punctuation removal, and stop word removal within `CountVectorizer`.
* The output of `CountVectorizer` is a numerical matrix that represents text data.

## Resources

`Scikit-learn` documentation on `CountVectorizer`: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html