###What is Feature Extrection??

- Feature extraction is the process of transforming raw data into numerical features that can be used for machine learning. This is an important step in machine learning because it allows us to use algorithms that can only operate on numerical data.

###Why we need it ?

- Improving the accuracy and efficiency of machine learning models.
- Reducing computational costs and training time.
- Enhancing our understanding of the data.
- Enabling the use of machine learning algorithms on various types of data.

###why it is so difficult ?

- Understand the data and the task deeply.
- Balance information retention with dimensionality reduction.
- Deal with noise, redundancy, and high dimensionality.
- Navigate the trade-off between model performance and interpretability.

## One-Hot Encoding

 - List of fruits: "apple," "banana," "orange." One-hot encoding turns each fruit into a unique vector of 0s and 1s.

 - Example:

    apple: [1, 0, 0]

    banana: [0, 1, 0]
    
    orange: [0, 0, 1]


 - Advantages:
    1. Simple: Easy to understand and implement.
    2. No assumptions: Doesn't assume any relationship between the items.

 - Disadvantages:

    1. Huge vectors: If you have many items (like all the words in a language), the vectors become very long, wasting space.
    2. No meaning: Doesn't capture any meaning or relationship between the items (e.g., it doesn't know that apples and oranges are both fruits).

In [None]:
from sklearn.preprocessing import LabelBinarizer

In [None]:
fruits = ["apple", "banana", "orange", "apple"]

In [None]:
lb = LabelBinarizer()  # Create a LabelBinarizer object
encoded_fruits = lb.fit_transform(fruits) # Fit and transform the data

In [None]:
print(encoded_fruits)  # Print the one-hot encoded array
print(lb.classes_) # Print the classes (useful to know what each column represents)

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]]
['apple' 'banana' 'orange']


In [None]:
# Example with new data:
new_fruits = ["banana", "grape", "apple"]
encoded_new_fruits = lb.transform(new_fruits) # Transform new data using the fitted LabelBinarizer
print(encoded_new_fruits)

[[0 1 0]
 [0 0 0]
 [1 0 0]]


##Bag-of-Words

- sentences: "The cat sat on the mat" and "The dog sat on the mat." Bag-of-words counts how many times each word appears in each sentence.

- Example:

    Vocabulary: "the," "cat," "sat," "on," "mat," "dog"

    "The cat sat on the mat": [2, 1, 1, 1, 1, 0]
    
    "The dog sat on the mat": [2, 0, 1, 1, 1, 1]


- Advantages:

    1. Captures frequency: Shows how often words appear, which can be useful.
    2. Fixed size vectors: Even if sentences have different lengths, the vectors always have the same size.

- Disadvantages:

    1. Ignores order: Doesn't care about the order of words, so "the cat sat" is the same as "cat the sat."
Still large vectors: With a large vocabulary, vectors can still be quite long.
    2. No meaning: Like one-hot encoding, it doesn't understand the meaning of words.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame({"text":["people watch sneh",
                         "sneh watch sneh",
                         "people write comment",
                          "sneh write comment"],"output":[1,1,0,0]})

In [None]:
df

Unnamed: 0,text,output
0,people watch sneh,1
1,sneh watch sneh,1
2,people write comment,0
3,sneh write comment,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [None]:
bagOfWords = cv.fit_transform(df["text"])

In [None]:
print(cv.vocabulary_)

{'people': 1, 'watch': 3, 'sneh': 2, 'write': 4, 'comment': 0}


In [None]:
bagOfWords.toarray()

array([[0, 1, 1, 1, 0],
       [0, 0, 2, 1, 0],
       [1, 1, 0, 0, 1],
       [1, 0, 1, 0, 1]])

In [None]:
print(bagOfWords[0].toarray())
print(bagOfWords[1].toarray())
print(bagOfWords[2].toarray())

[[0 1 1 1 0]]
[[0 0 2 1 0]]
[[1 1 0 0 1]]


In [None]:
cv.transform(["sneh write comment"]).toarray()

array([[1, 0, 1, 0, 1]])

In [None]:
X = bagOfWords.toarray()
y = df['output']

In [None]:
X

array([[0, 1, 1, 1, 0],
       [0, 0, 2, 1, 0],
       [1, 1, 0, 0, 1],
       [1, 0, 1, 0, 1]])

In [None]:
y

Unnamed: 0,output
0,1
1,1
2,0
3,0


### N-Grams

In [None]:
df = pd.DataFrame({"text":["people watch sneh",
                         "sneh watch sneh",
                         "people write comment",
                          "sneh write comment"],"output":[1,1,0,0]})

In [None]:
df

Unnamed: 0,text,output
0,people watch sneh,1
1,sneh watch sneh,1
2,people write comment,0
3,sneh write comment,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

In [None]:
bow = cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people watch': 0, 'watch sneh': 4, 'sneh watch': 2, 'people write': 1, 'write comment': 5, 'sneh write': 3}


In [None]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[1 0 0 0 1 0]]
[[0 0 1 0 1 0]]
[[0 1 0 0 0 1]]


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(3,3))

In [None]:
bow = cv.fit_transform(df['text'])

In [None]:
print(cv.vocabulary_)

{'people watch sneh': 0, 'sneh watch sneh': 2, 'people write comment': 1, 'sneh write comment': 3}


In [None]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[1 0 0 0]]
[[0 0 1 0]]
[[0 1 0 0]]


## TF-IDF (Term frequency- Inverse document frequency)

- It's a statistical measure used to evaluate how important a word is to a document within a collection of documents (a corpus).  

###Why is TF-IDF useful?

- Information Retrieval:  TF-IDF is widely used in search engines.  When you search for something, the search engine uses TF-IDF to rank the relevance of documents to your query.  Words that are frequent in your query and rare in the corpus will have a high TF-IDF score and will be given higher ranking.

- Text Mining:  TF-IDF can be used to extract keywords from documents.  Words with high TF-IDF scores are likely to be the most important words in the document.

- Document Classification:  TF-IDF can be used to represent documents as vectors of TF-IDF scores.  These vectors can then be used to train machine learning models for document classification.

In [None]:
df = pd.DataFrame({"text":["people watch sneh",
                         "sneh watch sneh",
                         "people write comment",
                          "sneh write comment"],"output":[1,1,0,0]})

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid= TfidfVectorizer()

In [None]:
arr = tfid.fit_transform(df['text']).toarray()

In [None]:
arr

array([[0.        , 0.61366674, 0.49681612, 0.61366674, 0.        ],
       [0.        , 0.        , 0.8508161 , 0.52546357, 0.        ],
       [0.57735027, 0.57735027, 0.        , 0.        , 0.57735027],
       [0.61366674, 0.        , 0.49681612, 0.        , 0.61366674]])

In [None]:
print(tfid.idf_)

[1.51082562 1.51082562 1.22314355 1.51082562 1.51082562]


##Word2Vec

- Word2Vec is a popular technique for learning word embeddings.  Word embeddings are a way to represent words as dense, continuous vectors in a high-dimensional space.  The key idea behind Word2Vec is that words that appear in similar contexts should have similar vector representations.  This allows Word2Vec to capture semantic relationships between words.


- Two Main Architectures:

    1. Continuous Bag-of-Words (CBOW):  CBOW tries to predict a target word based on the words in its context.  For example, if the context is "the cat sat on the," CBOW tries to predict the word "mat."

    2. Skip-gram: Skip-gram does the opposite.  It tries to predict the context words given a target word.  For example, given the word "mat," Skip-gram tries to predict the words "the," "cat," "sat," "on," and "the."



- Benefits of Word2Vec:

    1.Captures Semantic Relationships: Word2Vec can capture semantic relationships between words, such as synonyms, antonyms, and analogies.  For example, the vectors for "king" and "queen" will be closer together than the vectors for "king" and "car."

    2.Dense Vectors: Word embeddings are dense, meaning that they have many non-zero values.  This makes them more efficient to store and process than one-hot vectors.

    3.Widely Used: Word2Vec is a widely used technique in natural language processing.  It is often used as a first step in other NLP tasks, such as text classification, sentiment analysis, and machine translation.

- Limitations of Word2Vec:

    1.Context Window: Word2Vec has a limited context window.  It only considers the words that are immediately surrounding a target word.  This means that it may not be able to capture long-range dependencies between words.

    2.Out-of-Vocabulary Words: Word2Vec cannot handle out-of-vocabulary words.  If a word is not in the training corpus, Word2Vec will not be able to create a vector for it.

    3.Ignores Word Order (in CBOW): The CBOW model averages the vectors of the context words, thus losing information about word order. Skip-gram is better at capturing some aspects of word order.

In [45]:
# https://drive.google.com/file/d/1lbtAwzE7l0otXYFDtGUKKWzI83bD5D5H/view?usp=drive_link

In [46]:
import numpy as np
import pandas as pd
import gensim
import os

In [49]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [50]:
story = []
for filename in os.listdir('data'):   #data folder and test.txt file
    if filename == '.ipynb_checkpoints':
      pass
    f = open(os.path.join('data',filename))
    corpus = f.read()
    raw_sent = sent_tokenize(corpus)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [51]:
story

[['game',
  'of',
  'thrones',
  'book',
  'one',
  'of',
  'song',
  'of',
  'ice',
  'and',
  'fire',
  'by',
  'george',
  'martin',
  'prologue',
  'we',
  'should',
  'start',
  'back',
  'gared',
  'urged',
  'as',
  'the',
  'woods',
  'began',
  'to',
  'grow',
  'dark',
  'around',
  'them'],
 ['the', 'wildlings', 'are', 'dead'],
 ['do', 'the', 'dead', 'frighten', 'you'],
 ['ser',
  'waymar',
  'royce',
  'asked',
  'with',
  'just',
  'the',
  'hint',
  'of',
  'smile'],
 ['gared', 'did', 'not', 'rise', 'to', 'the', 'bait'],
 ['he',
  'was',
  'an',
  'old',
  'man',
  'past',
  'fifty',
  'and',
  'he',
  'had',
  'seen',
  'the',
  'lordlings',
  'come',
  'and',
  'go'],
 ['dead', 'is', 'dead', 'he', 'said'],
 ['we', 'have', 'no', 'business', 'with', 'the', 'dead'],
 ['are', 'they', 'dead'],
 ['royce', 'asked', 'softly'],
 ['what', 'proof', 'have', 'we'],
 ['will', 'saw', 'them', 'gared', 'said'],
 ['if',
  'he',
  'says',
  'they',
  'are',
  'dead',
  'that',
  'proof',


In [74]:
model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

In [54]:
model.build_vocab(story)

In [55]:
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(322670, 447775)

In [57]:
model.wv.most_similar("daenerys")

[('still', 0.9991561770439148),
 ('dothraki', 0.9990547299385071),
 ('off', 0.9990369081497192),
 ('an', 0.9990357756614685),
 ('all', 0.9990357160568237),
 ('three', 0.9990352988243103),
 ('bran', 0.9990295767784119),
 ('while', 0.9990199208259583),
 ('dany', 0.9990194439888),
 ('two', 0.9990050792694092)]

In [62]:
model.wv['deep'].shape

(100,)

In [63]:
vec = model.wv.get_normed_vectors()

In [64]:
vec

array([[-0.13934398,  0.07407901,  0.08952442, ..., -0.08813339,
         0.08709886,  0.02906539],
       [-0.13297635,  0.07225876,  0.08452213, ..., -0.08704549,
         0.08420957,  0.01978605],
       [-0.09532618,  0.05333837,  0.06874396, ..., -0.08400211,
         0.08501223, -0.01437212],
       ...,
       [-0.08854766,  0.10782592,  0.09496778, ..., -0.10206794,
         0.08771088,  0.02310876],
       [-0.08634775,  0.07934137,  0.04834182, ..., -0.10746276,
         0.06899563, -0.02590153],
       [-0.01618071,  0.1257968 , -0.01316106, ..., -0.07061569,
         0.09569708, -0.02096259]], dtype=float32)

In [65]:
from sklearn.decomposition import PCA

In [66]:
pca = PCA(n_components=3)

In [67]:
X = pca.fit_transform(model.wv.get_normed_vectors())

In [68]:
X

array([[-0.01315868, -0.20178273,  0.01060704],
       [-0.02425832, -0.16169518,  0.00648315],
       [-0.05171812,  0.03014968, -0.00579617],
       ...,
       [-0.00399983,  0.02738353,  0.02748627],
       [-0.02966475, -0.03668648,  0.0089586 ],
       [ 0.05956769,  0.08699934, -0.11506955]], dtype=float32)

In [69]:
X.shape

(3840, 3)

In [72]:
import plotly.express as px

In [73]:
fig = px.scatter_3d(X[200:300], x=0, y=1, z=2, color=list(range(len(X[200:300])))) #Replace with next line if you have y defined correctly
#fig = px.scatter_3d(X[200:300], x=0, y=1, z=2, color=y[200:300])
fig.show()