## Bag of Words (BoW)

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are pets."
]

# Initialize CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Convert to array and print
print(X.toarray())
print(vectorizer.get_feature_names_out())


[[0 0 1 0 0 0 0 1 1 0 1 2]
 [0 0 0 0 1 0 1 0 1 0 1 2]
 [1 1 0 1 0 1 0 0 0 1 0 0]]
['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'log' 'mat' 'on' 'pets' 'sat' 'the']


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "Cats and dogs are pets."
]
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Convert to array and print
print(X_tfidf.toarray())
print(tfidf_vectorizer.get_feature_names_out())


[[0.         0.         0.42755362 0.         0.         0.
  0.         0.42755362 0.32516555 0.         0.32516555 0.6503311 ]
 [0.         0.         0.         0.         0.42755362 0.
  0.42755362 0.         0.32516555 0.         0.32516555 0.6503311 ]
 [0.4472136  0.4472136  0.         0.4472136  0.         0.4472136
  0.         0.         0.         0.4472136  0.         0.        ]]
['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'log' 'mat' 'on' 'pets' 'sat' 'the']


In [1]:
!pip install gensim




In [7]:
# Python program to generate word vectors using Word2Vec

# importing all necessary modules
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings

warnings.filterwarnings(action='ignore')


# Reads ‘alice.txt’ file
sample = open("C:\\Users\\Admin\\Desktop\\alice.txt")
s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []

# iterate through each sentence in the file
for i in sent_tokenize(f):
	temp = []

	# tokenize the sentence into words
	for j in word_tokenize(i):
		temp.append(j.lower())

	data.append(temp)

# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count=1,
								vector_size=100, window=5)

# Print results
print("Cosine similarity between 'alice' " +
	"and 'wonderland' - CBOW : ",
	model1.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " +
	"and 'machines' - CBOW : ",
	model1.wv.similarity('alice', 'machines'))

# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count=1, vector_size=100,
								window=5, sg=1)

# Print results
print("Cosine similarity between 'alice' " +
	"and 'wonderland' - Skip Gram : ",
	model2.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " +
	"and 'machines' - Skip Gram : ",
	model2.wv.similarity('alice', 'machines'))


ModuleNotFoundError: No module named 'gensim'

Here's a complete **text classification project** that covers:

* CountVectorization
* TF-IDF Vectorization
* Text Classification using ML (e.g., Logistic Regression, Naive Bayes)
* Text Classification using ANN
* With assignments

---

## 🧠 Project Title: **Sentiment Analysis of Product Reviews**

### 📁 Dataset:

Use the **Amazon Product Review** dataset or any CSV with `['review_text', 'label']`, where `label` is 0 (negative) or 1 (positive). You can use a Kaggle dataset or a mock dataset for teaching.

---

### 📌 PART 1: **Data Preprocessing**

* Load dataset
* Clean text: Lowercase, remove punctuation, stopwords, stemming/lemmatization
* Split into train/test

---

### 📌 PART 2: **Text Vectorization**

#### A. CountVectorizer

```python
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)
```

#### B. TF-IDF Vectorizer

```python
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
```

---

### 📌 PART 3: **Text Classification using ML**

#### A. Logistic Regression (or try Naive Bayes, SVM)

```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

---

### 📌 PART 4: **Text Classification using ANN**

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

model = Sequential()
model.add(Dense(64, activation='relu', input_dim=X_train_tfidf.shape[1]))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_tfidf.toarray(), y_train, epochs=5, batch_size=32, validation_split=0.2)
```

---

Yes, absolutely! You **can train your own Word2Vec model** from scratch using your **own dataset** — such as customer reviews, product descriptions, or any custom text corpus.

---

### ✅ When Should You Train Word2Vec on Your Own Data?

* You have **domain-specific vocabulary** (e.g., legal, medical, retail).
* Pre-trained embeddings (like Google News vectors) **miss important context** in your data.
* You want **embeddings tailored** to your use case (e.g., customer sentiments, product names, slang).

---

### 🛠️ How to Train Word2Vec on Your Own Data (Using `gensim`)

#### 1. **Install Required Library**

```bash
pip install gensim
```

#### 2. **Prepare Your Text Corpus**

Make sure your text is tokenized into sentences and words.

```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt')

# Example: simple dataset
text = """I love this phone. The camera quality is amazing. Battery lasts all day."""
sentences = sent_tokenize(text)
tokenized_data = [word_tokenize(sent.lower()) for sent in sentences]
```

#### 3. **Train Word2Vec Model**

```python
model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4)
```

#### 4. **Use the Model**

```python
# Get vector for word 'phone'
vector = model.wv['phone']

# Find most similar words
similar = model.wv.most_similar('camera')
```

#### 5. **Save & Load Model**

```python
model.save("custom_word2vec.model")
# Later: model = Word2Vec.load("custom_word2vec.model")
```

---

### ⚙️ Key Parameters

| Parameter     | Description                                   |
| ------------- | --------------------------------------------- |
| `vector_size` | Dimensionality of word vectors                |
| `window`      | Context window size                           |
| `min_count`   | Ignores words with total frequency below this |
| `workers`     | Threads for training                          |

---

### 📚 Example Use Cases:

* Build a recommendation system using similarity between products
* Cluster similar reviews
* Visualize word relationships using t-SNE

---

Would you like a complete Jupyter Notebook or a sample dataset to test this on?
