# NLP

## Fundamentals

NLP or Natural Language Processing is a subfield of Artificial Intelligence that gives machines the ability to **understand and extract meaning from human languages**.

![image.png](attachment:0001a3df-52b4-4011-877c-727e5e76c35d.png)

NLP is a field that focuses on the interaction between data science and human language. It allows data scientists to derive meaningful results in areas like media, healthcare, finance, and human resources, etc. Nowadays NLP is booming thanks to the huge improvements in acquiring the data and the sharp increase in computational power.

Natural Language Processing is a field of computer science that deals with communication between computer systems and humans. It is a technique used in Artificial Intelligence and Machine Learning. It is used to create automated software that helps understand human spoken languages to extract useful information from the data it gets in the form of audio. Techniques in NLP allow computer systems to process and interpret data in the form of natural languages.

NLP can help people with many tasks. Some examples are given below:

- **Diagnosing**: Prediction of diseases based on the patient’s own speech and electronic health records.

- **Sentiment Analysis**: Organizations can determine what customers are feeling about a product or service by extracting information from sources like social media.

- **Translator**: Online translators have never been so successful before NLP was used in that field. 

- **ChatBot**: To communicate with the customer like an actual employee.

- **Classifying emails**:  To classify emails as spam or ham and stop spams before they even enter the inbox.

- **Detecting Fake News**: To determine if a source is politically biased or accurate, detecting if a news source can be trusted or not.

- **Intelligent Voice-Driven Interfaces**: Apple’s Siri or Android's Iris are examples of intelligent voice-driven interfaces that use NLP to respond to humans.

- **Trading Algorithms**: Tracking news, reports, comments about financing to sell or buy the stocks automatically.

- **Recruiting Assistant**: Both the search and selection phases of new employees and identifying the skills of potential hires.

- **Litigation Tasks**: To automate routine litigation tasks and help courts save time.

Two real-life applications of Natural Language Processing are :

Google Translate: Google Translate is one of the famous applications of Natural Language Processing. It helps convert written or spoken sentences into any language. Also, we can find the correct pronunciation and meaning of a word by using Google Translate. It uses advanced techniques of Natural Language Processing to achieve success in translating sentences into various languages.

Chatbots: To provide a better customer support service, companies have started using chatbots for 24/7 service. Chatbots helps resolve the basic queries of customers. If a chatbot is not able to resolve any query, then it forwards it to the support team, while still engaging the customer. It helps make customers feel that the customer support team is quickly attending them. With the help of chatbots, companies have become capable of building cordial relations with customers. It is only possible with the help of Natural Language Processing.

Sentiment Anaalysis and Text classification disindaki modellerin tumu advanced konular ve transfer learning ile yapiliyor. Manual yaptigimizda skorlar kotu geliyor cunku. Bu nedenle bu 2sini manual yapmayi ogrenip diger konulari mevcut modeller uzerinden cozmek daha efektif bir yol.

## NLP Theory

Main concepts:

- Document and Corpus,
- Vectorization,
- Bag of Words,
- TF-IDF.

### Document vs Corpus:

Corpus refers to a collection of texts.  It is described as "a large and structured set of texts" in Wikipedia. The term document has a very restricted meaning when compared with the corpus. For example, you are implementing some NLP tasks using a book. Let's consider each paragraph of the book is a row in your dataset. The book itself is the corpus and each paragraph is a document. Another example can be given about a dataset including numerous SMSs. Each SMS message will be a document and the whole dataset (whole SMS messages) will form the corpus. The database in which the corpusses are kept is called **corpora**.

### Vectorization :

Word vectorization is a methodology in NLP to map words from vocabulary to a corresponding numeric vector. In other words, it is the process of converting words into numbers.


There are several methods to implement vectorization. But maybe the most famous methods are listed below:
- Count Vectors (CountVectorizer)
- TF-IDF Vectors (TfidfVectorizer)
- Word Embeddings (Word2Word, GloVe, etc.)

We will focus on CountVectorizer and TF-IDF methods. Maybe CountVectorizer is the simplest method to vectorize the text.  The logic behind that method is to count each word in each document. Let's have a look following two documents:

Document-1: John likes to watch movies. Mary likes movies too

Document-2: Mary also likes to watch football games

The CountVectorizer will convert a collection of text documents to a matrix of token counts. We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension is the actual documents, in this case, a column per document.

![image.png](attachment:278ea57d-4ae3-452c-9969-bbc04ad0ee83.png)

Vectors of each document can be demonstrated as follows:

- vector-1 (Document-1): [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
- vector-2 (Document-2): [0, 1, 1, 1, 0, 1, 0, 1, 1, 1]

its Jason object representation will be as follows. This method is called the **Bag of Words (BoW)** modeling:

- BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
- BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1}

### TF-IDF

TF-IDF method is more advanced than just counting the words. TF-IDF stands for **term frequency-inverse document frequency**, and the **tf-idf weight is a weight used in NLP.  This weight is a statistical measure used to evaluate how important a word is to a document in a corpus**. The importance increases proportionally to the number of times, a word appears in the document but is offset by the frequency of the word in the corpus. 

**What is TF (Term Frequency)?**

TF measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term can appear much more times in long documents than shorter ones. Thus, the term frequency is divided by the document length as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

**What is IDF (Inverse Document Frequency)?**

IDF measures how important a term is. While computing TF, all terms are considered equally important. However, it is known that certain terms, such as "are", "a", "the", "is", "of", and "that" which are called **stop words**, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

IDF(t) = log(Total number of documents  in a corpus/ Number of documents with term t in it).

Inverse olma sebebi negatif sonucu pozitife cevirmek icinndir. Log ile ise bir nevi dogal, logaritmik bir scale'leme yapmis oluruz.

**Example:**

Consider a document containing 100 words wherein the word cow appears 3 times.

The TF for cow is then (3 / 100) = 0.03. Now, let's assume the corpus has 10 million documents and the word cow appears in one thousand of these. Then, the IDF is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

TF-IDF'te amac bir kelimehem document hem de corpus icin ne kadar onemli, ortak bir katsayi dondurmek. Ne kadar buyuk olursa o kelimenin datamiz icin o kadar daha onemli oldugunu gosterir.


## Data Preparation With NLP

**The text pre-processing simply means to bring a text into a form that is analyzable and predictable for a task**. In this respect the steps are;

- Remove punctuation,
- Remove stopwords,
- Tokenize the text,
- Vectorize the words.

Cleaning Steps (mostly for classification and sentiment analysis):

1. clean text data (classification and sentiment analysisde tum harfller lower yapilir; ama advanced mdodellerde ozelb ri nesne , yer, sahsa dair bazi buyuk harfle baslayan kelimeler aynen tutulur).
2. do tokenization 
3. remove stop words
4. remove punctuation (sadece classification ve sentiment analysisde temizlenir. GPT, Berd gibi advanced modellerde ise bunlar tutulur. Sent ve clas.da temizlenme nedeni, punctuationdan ziyade keywordlere egilmesi)
5. do stemming (koklerine inme sadece ML modellerinde kullanilir, cunku ML modelleri kelimeler arasi anlamsal iliskileri kuramaz. DL modellerinde ise kullanilmaz, cunku DL ise kelimeler arasi anlam iliskisi kurup hangi kelimelerle hangisi kullanilabilecegini ogrenebilir, yanni hangi kelimeden sonra hangisinin gelebilecegini tahmin edebiliyor)
6. lemmatization (lemmatization sozcugun kokune inmek icin ek atildiginda anlam kaybi olup olmadigini da kontrol eder ve bu nednele stemmingden daha etkin ve daha tercih edilirdir. Orn United kelimesini stemming Unite yaparken lemmatization ise United olarak tutar. İkisini de deneyio hangisi daha iyi sonuc veriyorsa onla devam etmek daha ML de stemming ve lemmatization var; DL'de yok).
7. word normalization

![image.png](attachment:a54f2b6e-9345-45be-af6f-c4342827efac.png)

### Removing Punctuation:

There might be some punctuation such as commas, quotes, apostrophes, question marks, and more. We cannot feed a machine learning model from raw text. We need to clean the text first. Removing the punctuation is usually one of the first steps of cleaning the text. Because the model does not need them.

### Removing Stopwords:

Stopwords do not contribute to the meaning of the text deeply. These words introduce much noise because they appear more frequently than other words. Some examples of stopwords are given below:

"and",  "the",  "how", "all",  "about", "on", "under", "up",  "after", "i", "me", "myself", "we", our", ours", your", "yours"...

We filter out these stopwords before doing any statistical analysis or creating a model.

### Tokenization: 

Tokenization means splitting a sentence, paragraph, phrase, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units is called tokens. We can consider each word as a token.

Tokenization is significant because the meaning of the text can easily be interpreted by analyzing the words present in the text. We can count the number of words in the text after tokenization. [More information about tokenization methods with python](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/)

### Vectorization :

Word vectorization is a methodology in NLP to map words from vocabulary to a corresponding numeric vector. In other words, it is the process of converting words into numbers.


There are several methods to implement vectorization. But maybe the most famous methods are listed below:
- Count Vectors (CountVectorizer)
- TF-IDF Vectors (TfidfVectorizer)
- Word Embeddings (Word2Word, GloVe, etc.)

ps:

Some challenges for NLP:

- **Ambiguity**: The chicken is ready to eat --- I saw her duck
- **Synonymy**: Synonymy refers to the relationship between words or expressions that have similar or identical meanings. It is the concept of having different words that can be used interchangeably in a particular context without changing the overall meaning of a sentence or phrase... big sister - large sister (mean older or physically bigger?)
- **Polysemy** (cok anlamlilik): for ex the word "play" can change with the following term: play a joke - play sports - play a part - play oppose... or I saw bats...
- **Coreference** : zamirlerin birbirnine olan atfi. "I voted for Nader bec he was most aligned with my values," she said. (model should determine the subjects for I, Nader, he,my, she... Coreference resolution is the process of identifying and connecting expressions in a text that refer to the same entity or concept. It helps us understand the relationships between different mentions of an entity and improves our comprehension of the text. To explain it simply, let's consider the following example: Text: "John went to the store. He bought some groceries." In this example, "John" and "He" both refer to the same person. Coreference resolution helps us understand that "He" refers back to "John" and that both pronouns represent the same individual.
- **Cultural and Domain Variations**: Synonym usage can vary across different cultures, regions, and domains. Words that are considered synonyms in one context may not be interchangeable in another. NLP models need to be trained on diverse and representative data to capture these variations and understand the appropriate usage of synonyms in different contexts.

![image.png](attachment:7298d411-7679-4725-9682-a5e3c223a0b4.png)

![image.png](attachment:37da0fa6-d172-4701-afff-8b3dac3f8b6d.png)