# text mining (nlp) with python/python ile metin madenciliği (nlp)

# *Introduction/Giriş*

This notebook contains code examples to get you started with Python for Natural Language Processing (NLP) / Text Mining.  

In the large scheme of things there are roughly 4 steps:  

1. Identify a data source  
2. Gather the data  
3. Process the data  
4. Analyze the data  

This notebook only discusses step 3 and 4. If you want to learn more about step 2 see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch). 

#### ------------------------------------------------------------

Bu not defteri, Doğal Dil İşleme (NLP) / Metin Madenciliği için Python'a başlamanıza yardımcı olacak kod örnekleri içerir.

Geniş şemada kabaca 4 adım vardır:

1. Bir veri kaynağı belirleyin
2. Verileri toplayın
3. Verileri işleyin
4. Verileri analiz edin

Bu not defteri yalnızca 3. ve 4. adımı ele almaktadır. 2. adım hakkında daha fazla bilgi edinmek istiyorsanız [Python eğitimime](https://github.com/TiesdeKok/LearnPythonforResearch) bakın.

## Note: companion slides

## Not: tamamlayıcı slaytlar

This notebook was designed to accompany a PhD course session on NLP techniques in Accounting Research.  
The slides of this session are publically availabe here: [Slides](http://www.tiesdekok.com/AccountingNLP_Slides/)

### --------------------------------------------

Bu defter, Muhasebe Araştırmalarında NLP teknikleri üzerine bir doktora kursu oturumuna eşlik etmek üzere tasarlanmıştır.

Bu oturumun slaytları şu adreste herkesin kullanımına sunulmuştur: [Slaytlar](http://www.tiesdekok.com/AccountingNLP_Slides/)

# *Elements / topics that are discussed in this notebook:*
# *Bu not defterinde ele alınan unsurlar / konular:*



<img style="float: left" src="https://i.imgur.com/c3aCZLA.png" width="50%" /> 

# *Table of Contents/İçindekiler*  <a id='toc'></a>

* [Primer on NLP tools](#tool_primer)     
* [Process + Clean text](#proc_clean)   
    * [Normalization](#normalization)
        * [Deal with unwanted characters](#unwanted_char)
        * [Sentence segmentation](#sentence_seg)   
        * [Word tokenization](#word_token)
        * [Lemmatization & Stemming](#lem_and_stem) 
    * [Language modeling](#lang_model) 
        * [Part-of-Speech tagging](#pos_tagging) 
        * [Uni-Gram & N-Grams](#n_grams) 
        * [Stop words](#stop_words) 
* [Direct feature extraction](#feature_extract) 
    * [Feature search](#feature_search) 
        * [Entity recognition](#entity_recognition) 
        * [Pattern search](#pattern_search) 
    * [Text evaluation](#text_eval) 
        * [Language](#language) 
        * [Dictionary counting](#dict_counting) 
        * [Readability](#readability) 
* [Represent text numerically](#text_numerical) 
    * [Bag of Words](#bows) 
        * [TF-IDF](#tfidf) 
    * [Word Embeddings](#word_embed) 
        * [Spacy](#spacyEmbedding)
        * [Word2Vec](#Word2Vec) 
* [Statistical models](#stat_models) 
    * ["Traditional" machine learning](#trad_ml) 
        * [Supervised](#trad_ml_supervised) 
            * [Naïve Bayes](#trad_ml_supervised_nb) 
            * [Support Vector Machines (SVM)](#trad_ml_supervised_svm) 
        * [Unsupervised](#trad_ml_unsupervised) 
            * [Latent Dirichilet Allocation (LDA)](#trad_ml_unsupervised_lda) 
            * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis) 
* [Model Selection and Evaluation](#trad_ml_eval) 
* [Neural Networks](#nn_ml)

##### --------------------------------------------------------------

* [NLP araçlarında astar](#tool_primer)
* [İşle + Metni temizle](#proc_clean)
    * [Normalleştirme](#normalleştirme)
        * [İstenmeyen karakterlerle uğraşın](#unwanted_char)
        * [Cümle segmentasyonu](#sentence_seg)
        * [Kelime belirleme](#word_token)
        * [Lemmatizasyon ve Saplama](#lem_and_stem)
    * [Dil modelleme](#lang_model)
        * [Konuşma Bölümü etiketleme](#pos_tagging)
        * [Tek Gram ve N-Gram](#n_gram)
        * [Kelimeleri durdur](#stop_words)
* [Doğrudan özellik çıkarma](#feature_extract)
    * [Özellik arama](#feature_search)
        * [Varlık tanıma](#entity_recognition)
        * [Kalıp arama](#pattern_search)
    * [Metin değerlendirmesi](#text_eval)
        * [Dil](#dil)
        * [Sözlük sayımı](#dict_counting)
        * [Okunabilirlik](#okunabilirlik)
* [Metni sayısal olarak göster](#text_numerical)
    * [Kelime Torbası](#yaylar)
        * [TF-IDF](#tfidf)
    * [Kelime Gömmeleri](#word_embed)
        * [Spacy](#spacyGömme)
        * [Word2Vec](#Word2Vec)
* [İstatistiksel modeller](#stat_models)
    * ["Geleneksel" makine öğrenimi](#trad_ml)
        * [Denetimli](#trad_ml_denetimli)
            * [Naïve Bayes](#trad_ml_supervised_nb)
            * [Destek Vektör Makineleri (SVM)](#trad_ml_supervised_svm)
        * [Denetimsiz](#trad_ml_undenetimsiz)
            * [Gizli Dirichilet Tahsisi (LDA)](#trad_ml_unsupervised_lda)
            * [pyLDAvis](#trad_ml_unsupervised_pyLDAvis)
* [Model Seçimi ve Değerlendirmesi](#trad_ml_eval)
* [Sinir Ağları](#nn_ml)

# <span style="text-decoration: underline;">Primer on NLP tools</span><a id='tool_primer'></a> [(to top)](#toc)

There are many tools available for NLP purposes.  
The code examples below are based on what I personally like to use, it is not intended to be a comprehsnive overview.  

Besides build-in Python functionality I will use / demonstrate the following packages:

**Standard NLP libraries**:
1. `Spacy` 
2. `NLTK` and the higher-level wrapper `TextBlob`

*Note: besides installing the above packages you also often have to download (model) data . Make sure to check the documentation!*

**Standard machine learning library**:

1. `scikit learn`

**Specific task libraries**:

There are many, just a couple of examples:

1. `pyLDAvis` for visualizing LDA)
2. `langdetect` for detecting languages
3. `fuzzywuzzy` for fuzzy text matching
4. `Gensim` for topic modelling

##### -------------------------------------------------------------------------------------

NLP amaçları için kullanılabilecek birçok araç vardır.
Aşağıdaki kod örnekleri, kişisel olarak kullanmaktan hoşlandığım şeye dayanmaktadır, kapsamlı bir genel bakış olması amaçlanmamıştır.

Yerleşik Python işlevselliğinin yanı sıra aşağıdaki paketleri kullanacağım/göstereceğim:

**Standart NLP kitaplıkları**:
1. "Boş"
2. "NLTK" ve üst düzey sarmalayıcı "TextBlob"

*Not: Yukarıdaki paketleri kurmanın yanı sıra, genellikle (model) verileri de indirmeniz gerekir. Belgeleri kontrol ettiğinizden emin olun!*

**Standart makine öğrenimi kitaplığı**:

1. `scikit öğrenmek`

**Belirli görev kitaplıkları**:

Pek çok, sadece birkaç örnek var:

1. LDA'yı görselleştirmek için `pyLDAvis`)
2. dilleri tespit etmek için "langdetect"
3. bulanık metin eşleştirme için "fuzzywuzzy"
4. Konu modelleme için "Gensim"

# <span style="text-decoration: underline;">Get some example data/Bazı örnek veriler alın</span><a id='example_data'></a> [(to top)](#toc)

There are many example datasets available to play around with, see for example this great repository:  
https://archive.ics.uci.edu/ml/datasets.php
#### -----------------------------------------------------

Oynanabilecek birçok örnek veri kümesi vardır, örneğin bu harika depoya bakın:
https://archive.ics.uci.edu/ml/datasets.php

#### -----------------------------------------------------


The data that I will use for most of the examples is the "Reuter_50_50 Data Set" that is used for author identification experiments. 

See the details here: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50  

#### -----------------------------------------------------

Örneklerin çoğunda kullanacağım veri, yazar tanımlama deneyleri için kullanılan "Reuter_50_50 Veri Kümesi" olacaktır.

Ayrıntılara buradan bakın: https://archive.ics.uci.edu/ml/datasets/Reuter_50_50

### Download and load the data/Verileri indirin ve yükleyin

Can't follow what I am doing here? Please see my [Python tutorial](https://github.com/TiesdeKok/LearnPythonforResearch) (although the `zipfile` and `io` operations are not very relevant).

Burada ne yaptığımı takip edemiyor musun? Lütfen [Python öğreticime](https://github.com/TiesdeKok/LearnPythonforResearch) bakın ("zipfile" ve "io" işlemleri pek alakalı olmasa da).

In [1]:
import requests, zipfile, io, os
from tqdm.notebook import tqdm

*Note:* for `tqdm` to work in JupyterLab you need to install the `@jupyter-widgets/jupyterlab-manager` using the puzzle icon in the left side bar. 

*Download and extract the zip file with the data *

* Verileri içeren zip dosyasını indirin ve çıkarın *

In [2]:
if not os.path.exists('C50test'):
    r = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/00217/C50.zip")
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall()

*Load the data into memory*

*Verileri belleğe yükleyin*

In [3]:
folder_dict = {'test' : 'C50test'}
text_dict = {'test' : {}}

In [4]:
!pip install -U ipywidgets





[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
for label, folder in tqdm(folder_dict.items()):
    authors = os.listdir(folder)
    for author in authors:
        text_files = os.listdir(os.path.join(folder, author))
        for file in text_files:
            with open(os.path.join(folder, author, file), 'r') as text_file:
                text_dict[label].setdefault(author, []).append(' '.join(text_file.readlines()))

  0%|          | 0/1 [00:00<?, ?it/s]

*Note: the text comes pre-split per sentence, for the sake of example I undo this through `' '.join(text_file.readlines()`*

*Not: metin cümle başına önceden bölünmüş olarak gelir, örnek olsun diye `' '.join(text_file.readlines()`* aracılığıyla bunu geri alıyorum

In [6]:
text_dict['test']['TimFarrand'][0]

'Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997. The shares fell 6p to 781p on the news.\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  \n Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."\n Whatever the MMC decides many analysts expect Lang to defer a decision until after the next general election which will be called by May 22.\n "They will probably try to defer the decision until after the election. I don\'t think they want the negative PR of having a large number of people fired," said Wakley.  \n If the deal does not go throu

# <span style="text-decoration: underline;">Process + Clean text/İşle + Metni temizle</span><a id='proc_clean'></a> [(to top)](#toc)

## Convert the text into a NLP representation/Metni bir NLP temsiline dönüştürün

We can use the text directly, but if want to use packages like `spacy` and `textblob` we first have to convert the text into a corresponding object.  

Metni doğrudan kullanabiliriz, ancak "spacy" ve "textblob" gibi paketleri kullanmak istiyorsak önce metni karşılık gelen bir nesneye dönüştürmemiz gerekir.

### Spacy

**Note:** depending on the way that you installed the language models you will need to import it differently:

**Not:** dil modellerini yükleme şeklinize bağlı olarak, onu farklı şekilde içe aktarmanız gerekir:

```
from spacy.en import English
nlp = English()
```
OR/VEYA
```
import en_core_web_sm
nlp = en_core_web_sm.load()

import en_core_web_md
nlp = en_core_web_md.load()

import en_core_web_lg
nlp = en_core_web_lg.load()
```

In [8]:
!pip install spacy
!python -m spacy download en_core_web_md





[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/42.8 MB ? eta -:--:--
     --------------------------------------- 0.1/42.8 MB 919.0 kB/s eta 0:00:47
     ---------------------------------------- 0.1/42.8 MB 1.1 MB/s eta 0:00:41
     ---------------------------------------- 0.2/42.8 MB 1.3 MB/s eta 0:00:35
     ---------------------------------------- 0.3/42.8 MB 1.4 MB/s eta 0:00:32
     ---------------------------------------- 0.4/42.8 MB 1.4 MB/s eta 0:00:30
     ---------------------------------------- 0.4/42.8 MB 1.4 MB/s eta 0:00:30
     ---------------------------------------- 0.4/42.8 MB 1.4 MB/s eta 0:00:30
     ---------------------------------------- 0.4/42.8 MB 1.1 MB/s eta 0:00:40
      --------------------------------------- 0


[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import spacy
import en_core_web_md
nlp = en_core_web_md.load()

Convert all text in the "test" sample to a `spacy` `doc` object using `nlp.pipe()`:

"Test" örneğindeki tüm metni "nlp.pipe()" kullanarak bir "spacy" "doc" nesnesine dönüştürün:

In [10]:
spacy_text = {}
for author, text_list in tqdm(text_dict['test'].items()):
    spacy_text[author] = list(nlp.pipe(text_list))

  0%|          | 0/50 [00:00<?, ?it/s]

*A note on speed:*  This is slow because we didn't disable any compontents, see this note from the documentation:  
> Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need – either when loading a model, or during processing with nlp.pipe. See the section on disabling pipeline components for more details and examples. [link](https://spacy.io/usage/processing-pipelines#disabling)

#### -------------------------------------------------------------

*Hızla ilgili bir not:* Herhangi bir bileşeni devre dışı bırakmadığımız için bu yavaştır, belgelerden şu nota bakın:
> Yalnızca ihtiyacınız olan boru hattı bileşenlerini uygulayın. Modelden gerçekten ihtiyacınız olmayan tahminler almak, ölçek açısından çok verimsiz hale gelir. Bunu önlemek için, bir model yüklerken veya nlp.pipe ile işlem yaparken ihtiyacınız olmayan bileşenleri devre dışı bırakmak için devre dışı bırak anahtar sözcüğü bağımsız değişkenini kullanın. Daha fazla ayrıntı ve örnek için ardışık düzen bileşenlerini devre dışı bırakma bölümüne bakın. [bağlantı](https://spacy.io/usage/processing-pipelines#disabling)

In [11]:
type(spacy_text['TimFarrand'][0])

spacy.tokens.doc.Doc

### NLTK

In [12]:
import nltk

We can apply basic `nltk` operations directly to the text so we don't need to convert first. 

### TextBlob

In [13]:
from textblob import TextBlob

Convert all text in the "test" sample to a `TextBlob` object using `TextBlob()`:

In [14]:
textblob_text = {}
for author, text_list in text_dict['test'].items():
    textblob_text[author] = [TextBlob(text) for text in text_list]

In [15]:
type(textblob_text['TimFarrand'][0])

textblob.blob.TextBlob

## <span style="text-decoration: underline;">Normalization</span><a id='normalization'></a> [(to top)](#toc)

**Text normalization** describes the task of transforming the text into a different (more comparable) form.  

This can imply many things, I will show a couple of options below:

#### ------------------------------------------

**Metin normalleştirme**, metni farklı (daha karşılaştırılabilir) bir forma dönüştürme görevini tanımlar.

Bu birçok şeyi ima edebilir, aşağıda birkaç seçenek göstereceğim:

### <span style="text-decoration: underline;">Deal with unwanted characters/İstenmeyen karakterlerle başa çıkma</span><a id='unwanted_char'></a> [(to top)](#toc)

You will often notice that there are characters that you don't want in your text.  

Let's look at this sentence for example:

> "Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain\'s Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

You notice that there are some `\` and `\n` in there. These are used to define how a string should be displayed, if we print this text we get:  

#### ---------------------------------------------------------------

Metninizde istemediğiniz karakterler olduğunu sık sık fark edeceksiniz.

Örneğin şu cümleye bakalım:

> "Biracılık-eğlence grubu Bass Plc'deki hisselerin, İngiltere'nin Ticaret ve Sanayi sekreteri Ian Lang'in bira üreticisi Carlsberg-Tetley ile önerilen birleşmeye izin verip vermeyeceğine karar verene kadar tutulacağını söylediler, dedi analistler.\n Daha önce Lang duyurdu Bas anlaşması, Tekeller ve Birleşmeler olarak anılacaktır"

Orada bazı `\` ve `\n` olduğunu fark ettiniz. Bunlar, bir dizenin nasıl görüntülenmesi gerektiğini tanımlamak için kullanılır, eğer bu metni yazdırırsak şunu elde ederiz:

In [16]:
text_dict['test']['TimFarrand'][0][:298]

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

In [17]:
print(text_dict['test']['TimFarrand'][0][:298])

Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers


These special characters can cause problems in our analyses (and can be hard to debug if you are using `print` statements to inspect the data).

**So how do we remove them?**

###### ------------------------------------------------------------------------------------

Bu özel karakterler, analizlerimizde sorunlara neden olabilir (ve verileri incelemek için 'yazdır' ifadeleri kullanıyorsanız hata ayıklaması zor olabilir).

**Peki bunları nasıl kaldıracağız?**

In many cases it is sufficient to simply use the `.replace()` function:

Çoğu durumda `.replace()` işlevini kullanmak yeterlidir:

In [18]:
text_dict['test']['TimFarrand'][0][:298].replace('\n', '').replace('\\', '')

"Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts. Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers"

Sometimes, however, the problem arrises because of encoding / decoding problems.  

In those cases you can usually do something like:  

### --------------------------------

Ancak bazen, kodlama / kod çözme sorunları nedeniyle sorun ortaya çıkar.

Bu durumlarda genellikle şöyle bir şey yapabilirsiniz:

In [19]:
problem_sentence = 'This is some \u03c0 text that has to be cleaned\u2026! it\u0027s difficult to deal with!'
print(problem_sentence)
print(problem_sentence.encode().decode('unicode_escape').encode('ascii','ignore'))

This is some π text that has to be cleaned…! it's difficult to deal with!
b"This is some  text that has to be cleaned! it's difficult to deal with!"


An alternative that is better at preserving the unicode characters would be to use `unidecode`

Unicode karakterleri korumada daha iyi bir alternatif, "unidecode" kullanmaktır.

In [21]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
     ---------------------------------------- 0.0/235.9 kB ? eta -:--:--
     ----- --------------------------------- 30.7/235.9 kB 1.3 MB/s eta 0:00:01
     ------ ------------------------------ 41.0/235.9 kB 393.8 kB/s eta 0:00:01
     ----------------- ------------------ 112.6/235.9 kB 930.9 kB/s eta 0:00:01
     --------------------- -------------- 143.4/235.9 kB 944.1 kB/s eta 0:00:01
     -------------------------------------- 235.9/235.9 kB 1.0 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.3.6



[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
import unidecode

In [23]:
print('\u738b\u7389')

王玉


In [24]:
unidecode.unidecode(u"\u738b\u7389")

'Wang Yu '

In [25]:
unidecode.unidecode(problem_sentence)

"This is some p text that has to be cleaned...! it's difficult to deal with!"

### <span style="text-decoration: underline;">Sentence segmentation/Cümle segmentasyonu</span><a id='sentence_seg'></a> [(to top)](#toc)

Sentence segmentation refers to the task of splitting up the text by sentence.  

You could do this by splitting on the `.` symbol, but dots are used in many other cases as well so it is not very robust:

### -----------------------------------------

Cümle bölümleme, metni cümleye göre bölme görevini ifade eder.

Bunu `.` sembolünü bölerek yapabilirsiniz, ancak diğer birçok durumda noktalar da kullanılır, bu nedenle çok sağlam değildir:

In [26]:
text_dict['test']['TimFarrand'][0][:550].split('.')

["Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts",
 '\n Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997',
 ' The shares fell 6p to 781p on the news',
 '\n "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers',
 '  \n Dermott Carr, an analyst at Nikko said, "the mark']

It is better to use a more sophisticated implementation such as the one by `Spacy`:

"Spacy" gibi daha karmaşık bir uygulama kullanmak daha iyidir:

In [27]:
example_paragraph = spacy_text['TimFarrand'][0]

In [28]:
sentence_list = [s for s in example_paragraph.sents]
sentence_list[:5]

[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ,
 "The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
  ,
 Dermott Carr, an analyst at Nikko said, "the market is going to hang onto them for the moment but until we get a decision they will be held back."
  Whatever the MMC decides many analysts expect Lang to defer a decision until after the next general election which will be called by May 22.
  ]

Notice that the returned object is still a `spacy` object:

Döndürülen nesnenin hala bir "boşluk" nesnesi olduğuna dikkat edin:

In [29]:
type(sentence_list[0])

spacy.tokens.span.Span

*Note:* `spacy` sentence segmentation relies on the text being capitalized, so make sure you didn't convert it to all lower case before running this operation.

*Not:* `boşluklu` cümle segmentasyonu, metnin büyük harfle yazılmasına bağlıdır, bu nedenle bu işlemi çalıştırmadan önce metnin tamamını küçük harfe dönüştürmediğinizden emin olun.

Apply to all texts (for use later on):

Tüm metinlere uygula (daha sonra kullanmak için):

In [30]:
spacy_sentences = {}
for author, text_list in tqdm(spacy_text.items()):
    spacy_sentences[author] = [list(text.sents) for text in text_list]

  0%|          | 0/50 [00:00<?, ?it/s]

In [31]:
spacy_sentences['TimFarrand'][0][:3]

[Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
  ,
 Earlier Lang announced the Bass deal would be referred to the Monoplies and Mergers Commission which is due to report before March 24, 1997.,
 The shares fell 6p to 781p on the news.
  ]

### <span style="text-decoration: underline;">Word tokenization/Kelime belirteci</span><a id='word_token'></a> [(to top)](#toc)

Word tokenization means to split the sentence (or text) up into words.

Kelime belirteci, cümleyi (veya metni) kelimelere bölmek anlamına gelir.

In [32]:
example_sentence = spacy_sentences['TimFarrand'][0][0]
example_sentence

Shares in brewing-to-leisure group Bass Plc are likely to be held back until Britain's Trade and Industry secretary Ian Lang decides whether to allow its proposed merge with brewer Carlsberg-Tetley, said analysts.
 

A word is called a `token` in this context (hence `tokenization`), using `spacy`:

Bir kelime, bu bağlamda "belirteç" olarak adlandırılır (dolayısıyla "belirteçleştirme"), "boşluk" kullanılarak:

In [33]:
token_list = [token for token in example_sentence]
token_list[0:15]

[Shares,
 in,
 brewing,
 -,
 to,
 -,
 leisure,
 group,
 Bass,
 Plc,
 are,
 likely,
 to,
 be,
 held]

### <span style="text-decoration: underline;">Lemmatization & Stemming/Lemmatizasyon ve Stemming</span><a id='lem_and_stem'></a> [(to top)](#toc)

In some cases you want to convert a word (i.e. token) into a more general representation.  

For example: convert "car", "cars", "car's", "cars'" all into the word `car`.

This is generally done through lemmatization / stemming (different approaches trying to achieve a similar goal).  

#### --------------------------------------------------

Bazı durumlarda bir kelimeyi (yani belirteci) daha genel bir temsile dönüştürmek istersiniz.

Örneğin: "araba", "arabalar", "arabalar", "arabalar" kelimelerini "araba" kelimesine dönüştürün.

Bu genellikle lemmatizasyon / köklendirme (benzer bir hedefe ulaşmaya çalışan farklı yaklaşımlar) yoluyla yapılır.

**Spacy**

Space offers build-in functionality for lemmatization:

Space, lemmatizasyon için yerleşik işlevsellik sunar:

In [34]:
lemmatized = [token.lemma_ for token in example_sentence]
lemmatized[0:15]

['share',
 'in',
 'brewing',
 '-',
 'to',
 '-',
 'leisure',
 'group',
 'Bass',
 'Plc',
 'be',
 'likely',
 'to',
 'be',
 'hold']

**NLTK**

Using the NLTK libary we can also use the more aggressive Porter Stemmer

NLTK kütüphanesini kullanarak daha agresif Porter Stemmer'ı da kullanabiliriz.

In [35]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [36]:
stemmed = [stemmer.stem(token.text) for token in example_sentence]
stemmed[0:15]

['share',
 'in',
 'brew',
 '-',
 'to',
 '-',
 'leisur',
 'group',
 'bass',
 'plc',
 'are',
 'like',
 'to',
 'be',
 'held']

**Compare**:

In [37]:
print('  Original  | Spacy Lemma  | NLTK Stemmer')
print('-' * 41)
for original, lemma, stem in zip(token_list[:15], lemmatized[:15], stemmed[:15]):
    print(str(original).rjust(10, ' '), ' | ', str(lemma).rjust(10, ' '), ' | ', str(stem).rjust(10, ' '))

  Original  | Spacy Lemma  | NLTK Stemmer
-----------------------------------------
    Shares  |       share  |       share
        in  |          in  |          in
   brewing  |     brewing  |        brew
         -  |           -  |           -
        to  |          to  |          to
         -  |           -  |           -
   leisure  |     leisure  |      leisur
     group  |       group  |       group
      Bass  |        Bass  |        bass
       Plc  |         Plc  |         plc
       are  |          be  |         are
    likely  |      likely  |        like
        to  |          to  |          to
        be  |          be  |          be
      held  |        hold  |        held


In my experience it is usually best to use lemmatization instead of a stemmer. 

Deneyimlerime göre, genellikle bir saplayıcı yerine lemmatizasyon kullanmak en iyisidir.

## <span style="text-decoration: underline;">Language modeling/Dil modelleme</span><a id='lang_model'></a> [(to top)](#toc)

Text is inherently structured in complex ways, we can often use some of this underlying structure. 

Metin doğası gereği karmaşık şekillerde yapılandırılmıştır, bu temel yapının bir kısmını sıklıkla kullanabiliriz.

### <span style="text-decoration: underline;">Part-of-Speech tagging/Konuşma Parçası etiketleme</span><a id='pos_tagging'></a> [(to top)](#toc)

Part of speech tagging refers to the identification of words as nouns, verbs, adjectives, etc. 

Konuşma etiketlemenin bir kısmı, sözcüklerin isimler, fiiller, sıfatlar vb. olarak tanımlanmasını ifade eder.

Using `Spacy`:

In [38]:
pos_list = [(token, token.pos_) for token in example_sentence]
pos_list[0:10]

[(Shares, 'NOUN'),
 (in, 'ADP'),
 (brewing, 'NOUN'),
 (-, 'PUNCT'),
 (to, 'ADP'),
 (-, 'PUNCT'),
 (leisure, 'NOUN'),
 (group, 'NOUN'),
 (Bass, 'PROPN'),
 (Plc, 'PROPN')]

### <span style="text-decoration: underline;">Uni-Gram & N-Grams</span><a id='n_grams'></a> [(to top)](#toc)

Obviously a sentence is not a random collection of words, the sequence of words has information value.  

A simple way to incorporate some of this sequence is by using what is called `n-grams`.  
An `n-gram` is nothing more than a a combination of `N` words into one token (a uni-gram token is just one word).  

So we can convert `"Sentence about flying cars"` into a list of bigrams:

> Sentence-about, about-flying, flying-cars  

See my slide on N-Grams for a more comprehensive example: [click here](http://www.tiesdekok.com/AccountingNLP_Slides/#14)

#### ------------------------------------------------

Açıkçası, bir cümle rastgele bir kelime koleksiyonu değildir, kelime dizisinin bilgi değeri vardır.

Bu dizinin bazılarını birleştirmenin basit bir yolu, "n-gram" denilen şeyi kullanmaktır.
Bir "n-gram", "N" sözcüklerin bir belirteçte birleşiminden başka bir şey değildir (bir uni-gram belirteci yalnızca bir sözcüktür).

Böylece `"Uçan arabalar hakkında cümle"'yi bir bigram listesine dönüştürebiliriz:

> Cümle-hakkında, hakkında-uçan, uçan-arabalar

Daha kapsamlı bir örnek için N-Grams ile ilgili slaytıma bakın: [burayı tıklayın](http://www.tiesdekok.com/AccountingNLP_Slides/#14)

Using `NLTK`:

In [39]:
bigram_list = ['-'.join(x) for x in nltk.bigrams([token.text for token in example_sentence])]
bigram_list[10:15]

['are-likely', 'likely-to', 'to-be', 'be-held', 'held-back']

Using `spacy`

In [40]:
def tokenize_without_punctuation(sen_obj):
    return [token.text for token in sen_obj if token.is_alpha]

In [41]:
def create_ngram(sen_obj, n, sep = '-'):
    token_list = tokenize_without_punctuation(sen_obj)
    number_of_tokens = len(token_list)
    ngram_list = []
    for i, token in enumerate(token_list[:-n+1]):
        ngram_item = [token_list[i + ii] for ii in range(n)]
        ngram_list.append(sep.join(ngram_item))
    return ngram_list

In [42]:
create_ngram(example_sentence, 2)[:5]

['Shares-in', 'in-brewing', 'brewing-to', 'to-leisure', 'leisure-group']

In [43]:
create_ngram(example_sentence, 3)[:5]

['Shares-in-brewing',
 'in-brewing-to',
 'brewing-to-leisure',
 'to-leisure-group',
 'leisure-group-Bass']

### <span style="text-decoration: underline;">Stop words/Kelimeleri durdur</span><a id='stop_words'></a> [(to top)](#toc)

Depending on what you are trying to do it is possible that there are many words that don't add any information value to the sentence.  

The primary example are stop words.  

Sometimes you can improve the accuracy of your model by removing stop words.

#### ------------------------------------

Ne yapmaya çalıştığınıza bağlı olarak, cümleye herhangi bir bilgi değeri katmayan birçok kelime olabilir.

Birincil örnek, durdurma sözcükleridir.

Bazen durdurma sözcüklerini kaldırarak modelinizin doğruluğunu artırabilirsiniz.

Using `Spacy`:

In [44]:
no_stop_words = [token for token in example_sentence if not token.is_stop]

In [45]:
no_stop_words[:10]

[Shares, brewing, -, -, leisure, group, Bass, Plc, likely, held]

In [46]:
token_list[:10]

[Shares, in, brewing, -, to, -, leisure, group, Bass, Plc]

*Note* we can also remove punctuation in the same way:

*Not* noktalama işaretlerini de aynı şekilde kaldırabiliriz:

In [47]:
[token for token in example_sentence if not token.is_stop and token.is_alpha][:10]

[Shares, brewing, leisure, group, Bass, Plc, likely, held, Britain, Trade]

## Wrap everything into one function

## Her şeyi tek bir işleve toplayın

**Basic SpaCy text processing function**

**Temel SpaCy metin işleme işlevi**

1. Split into sentences
2. Apply lemmatizer, remove top words, remove punctuation
3. Clean up the sentence using `textacy`

#### --------------------------------------------

1. Cümlelere ayırın
2. Lemmatizer uygulayın, en sık kullanılan kelimeleri kaldırın, noktalama işaretlerini kaldırın
3. "textacy" kullanarak cümleyi temizleyin

In [48]:
def process_text_custom(text):
    sentences = list(nlp(text, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler']).sents)
    lemmatized_sentences = []
    for sentence in sentences:
        lemmatized_sentences.append([token.lemma_ for token in sentence if not token.is_stop and token.is_alpha])
    return [' '.join(sentence) for sentence in lemmatized_sentences]

In [49]:
spacy_text_clean = {}
for author, text_list in tqdm(text_dict['test'].items()):
    lst = []
    for text in text_list:
        lst.append(process_text_custom(text))
    spacy_text_clean[author] = lst

  0%|          | 0/50 [00:00<?, ?it/s]



*Note:* that this would take quite a long time if we didn't disable some of the components. 

*Not:* bazı bileşenleri devre dışı bırakmasaydık bu işlem oldukça uzun sürerdi.

In [52]:
count = 0
for author, texts in spacy_text_clean.items():
    for text in texts:
        count += len(text)
print('Number of sentences/cümle sayısı:', count)

Number of sentences: 53431


Result

In [53]:
spacy_text_clean['TimFarrand'][0][:3]

['shares brewing leisure group bass plc likely held britain trade industry secretary ian lang decides allow proposed merge brewer carlsberg tetley said analysts',
 'earlier lang announced bass deal referred monoplies mergers commission report march',
 'shares fell news']

*Note:* the quality of the input text is not great, so the sentence segmentation is also not great (without further tweaking).

*Not:* giriş metninin kalitesi çok iyi değil, bu nedenle cümle bölümleri de harika değil (daha fazla ince ayar yapmadan).

# <span style="text-decoration: underline;">Direct feature extraction/Doğrudan özellik çıkarma</span><a id='feature_extract'></a> [(to top)](#toc)

We now have pre-processed our text into something that we can use for direct feature extraction or to convert it to a numerical representation. 

Artık metnimizi, doğrudan özellik çıkarma veya sayısal bir temsile dönüştürmek için kullanabileceğimiz bir şeye önceden işledik.

## <span style="text-decoration: underline;">Feature search/Özellik arama</span><a id='feature_search'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Entity recognition/Varlık tanıma</span><a id='entity_recognition'></a> [(to top)](#toc)

It is often useful / relevant to extract entities that are mentioned in a piece of text.   

SpaCy is quite powerful in extracting entities, however, it doesn't work very well on lowercase text.  

Given that "token.lemma\_" removes capitalization I will use `spacy_sentences` for this example.

#### --------------------------------------------------------

Bir metin parçasında bahsedilen varlıkları çıkarmak genellikle yararlıdır/ilgilidir.

SpaCy, varlıkları çıkarmada oldukça güçlüdür, ancak küçük harfli metinlerde pek iyi çalışmaz.

"token.lemma\_" büyük harf kullanımını kaldırdığı için bu örnek için "spacy_sentences" kullanacağım.

In [54]:
example_sentence = spacy_sentences['TimFarrand'][0][3]
example_sentence

"The stock is probably dead in the water until March," said John Wakley, analyst at Lehman Brothers.  
 

In [55]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(March, 'DATE'), (John Wakley, 'PERSON'), (Lehman Brothers, 'ORG')]

In [56]:
example_sentence = spacy_sentences['TimFarrand'][4][0]
example_sentence

British pub-to-hotel group Greenalls Plc on Thursday reported a 48 percent rise in profits before exceptional items to 148.7 million pounds ($246.4 million), driven by its acquisition of brewer Boddington in November 1995.
 

In [57]:
[(i, i.label_) for i in nlp(example_sentence.text).ents]

[(British, 'NORP'),
 (Greenalls Plc, 'ORG'),
 (Thursday, 'DATE'),
 (48 percent, 'PERCENT'),
 (148.7 million pounds, 'MONEY'),
 ($246.4 million, 'MONEY'),
 (Boddington, 'GPE'),
 (November 1995, 'DATE')]

### <span style="text-decoration: underline;">Pattern search/Desen arama</span><a id='pattern_search'></a> [(to top)](#toc)

Using the build-in `re` (regular expression) library you can pattern match nearly anything you want.  

I will not go into details about regular expressions but see here for a tutorial:  
https://regexone.com/references/python  

### ----------------------------------

Yerleşik "re" (düzenli ifade) kitaplığını kullanarak, neredeyse istediğiniz her şeyi modelle eşleştirebilirsiniz.

Düzenli ifadeler hakkında ayrıntılara girmeyeceğim, ancak bir eğitim için buraya bakın:
https://regexone.com/references/python

In [58]:
import re

**TIP**: Use [Pythex.org](https://pythex.org/) to try out your regular expression

Example on Pythex: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Rest of Text." target='_blank'>click here</a>

##### ---------------------------------------

**İPUCU**: Normal ifadenizi denemek için [Pythex.org](https://pythex.org/) kullanın

Pythex ile ilgili örnek: <a href="https://pythex.org/?regex=IDNUMBER: (\d\d\d-\w\w)&test_string=Ties de Kok (IDNUMBER: 123-AZ). Metin." target='_blank'>burayı tıklayın</a>

**Example 1:**  

In [59]:
string_1 = 'Ties de Kok (#IDNUMBER: 123-AZ). Rest of text...'
string_2 = 'Philip Joos (#IDNUMBER: 663-BY). Rest of text...'

In [60]:
pattern = r'#IDNUMBER: (\d\d\d-\w\w)'

In [61]:
print(re.findall(pattern, string_1)[0])
print(re.findall(pattern, string_2)[0])

123-AZ
663-BY


### Example 2:

If a sentence contains the word 'million' return True, otherwise return False

Bir cümle 'milyon' kelimesini içeriyorsa True, aksi takdirde False döndür

In [62]:
for sen in spacy_text_clean['TimFarrand'][2]:
    TERM = 'million'
    if re.search('million', sen, flags= re.IGNORECASE):
        print(sen)

analysts forecasts pretax profits range million stg restructuring costs million time
restructuring cost million anticipated bulk million stemming closure smaller production plant france
cadbury drinks business turn million stg trading profit million half entirely contribution dr pepper
campbell estimates uk beverages contribute million stg operating profit million time
broadly analysts expect pretty flat performance group confectionery business consensus forecast million stg operating profits
average analysts calculate beverages chip trading profits million
sale percent stake coca cola amp schweppes beverages ccsb operations coca cola enterprises june million stg analysts want clear statement strategy company
far analysts company said shareholders expect return investments emerging markets largest far million russian plant
cadbury announced investment million stg building new plant wrocoaw poland joint venture china cost million
net debt billion end fall million end result ccsb sale pr

## <span style="text-decoration: underline;">Text evaluation/Metin değerlendirmesi</span><a id='text_eval'></a> [(to top)](#toc)

Besides feature search there are also many ways to analyze the text as a whole.  

Let's, for example, evaluate the following paragraph:

### ---------------------------------------

Özellik aramanın yanı sıra, metni bir bütün olarak analiz etmenin birçok yolu vardır.

Örneğin aşağıdaki paragrafı değerlendirelim:

In [63]:
example_paragraph = ' '.join([x for x in spacy_text_clean['TimFarrand'][2]])
example_paragraph[:500]

'soft drinks confectionery group cadbury schweppes plc expected report solid percent rise half profits wednesday faces questions performance soft drink main questions success relaunch brand said mark duffy food manufacturing analyst sbc warburg competitor sprite owned coca cola seen agressive marketing push ranked fastest growing brand cadbury dr pepper analysts forecasts pretax profits range million stg restructuring costs million time dividend pence expected restructuring cost million anticipat'

### <span style="text-decoration: underline;">Language/Dil</span><a id='language'></a> [(to top)](#toc)

Using the `spacy-langdetect` package it is easy to detect the language of a piece of text

Spacy-langdetect paketini kullanarak bir metin parçasının dilini tespit etmek kolaydır

In [66]:
!pip install spacy_langdetect

Collecting spacy_langdetect


[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading spacy_langdetect-0.1.2-py3-none-any.whl (5.0 kB)
Collecting pytest
  Downloading pytest-7.3.1-py3-none-any.whl (320 kB)
     ---------------------------------------- 0.0/320.5 kB ? eta -:--:--
     - -------------------------------------- 10.2/320.5 kB ? eta -:--:--
     --- --------------------------------- 30.7/320.5 kB 435.7 kB/s eta 0:00:01
     --------- --------------------------- 81.9/320.5 kB 657.6 kB/s eta 0:00:01
     ---------- -------------------------- 92.2/320.5 kB 751.6 kB/s eta 0:00:01
     ---------- -------------------------- 92.2/320.5 kB 751.6 kB/s eta 0:00:01
     ------------ ----------------------- 112.6/320.5 kB 437.6 kB/s eta 0:00:01
     ------------------ ----------------- 163.8/320.5 kB 517.2 kB/s eta 0:00:01
     --------------------- -------------- 194.6/320.5 kB 562.0 kB/s eta 0:00:01
     ------------------------------- ---- 276.5/320.5 kB 710.0 kB/s eta 0:00:01
     ------------------------------------ 320.5/320.5 kB 764.2 kB/s eta 0:00:0

In [70]:
from spacy.language import Language

@Language.factory('language_detector')
def create_language_detector(nlp, name):
    return LanguageDetector()

nlp.add_pipe('language_detector')


<spacy_langdetect.spacy_langdetect.LanguageDetector at 0x2271582dab0>

In [71]:
print(nlp(example_paragraph)._.language)

{'language': 'en', 'score': 0.9999968207324081}


### <span style="text-decoration: underline;">Readability/Okunabilirlik</span><a id='readability'></a> [(to top)](#toc)

Generally I'd recommend to calculate the readability metrics by yourself as they don't tend to be that difficult to compute. However, there are packages out there that can help, such as `spacy_readability`

Genel olarak, okunabilirlik ölçümlerini hesaplamak o kadar da zor olmadığından, kendi başınıza hesaplamanızı tavsiye ederim. Ancak, "spasy_readability" gibi yardımcı olabilecek paketler var.

In [79]:
!pip install --user spacy

!pip install --user spacy-readability








[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting spacy-readability
  Using cached spacy_readability-1.4.1-py3-none-any.whl (49 kB)
Installing collected packages: spacy-readability
Successfully installed spacy-readability-1.4.1



[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [80]:
from spacy_readability import Readability

In [85]:
import spacy
from spacy_readability import Readability

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(Readability())


Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "C:\Users\Serkan POLAT\AppData\Roaming\Python\Python310\site-packages\IPython\core\interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\Serkan POLAT\AppData\Local\Temp\ipykernel_13848\4253514877.py", line 4, in <module>
    nlp = spacy.load("en_core_web_sm")
  File "c:\Users\Serkan POLAT\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\__init__.py", line 54, in load
  File "c:\Users\Serkan POLAT\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 449, in load_model
    base_exceptions (dict): Base exceptions.
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Serkan POLAT\AppData\Roaming\Python\Python310\site-packages\IPython\core\interactiveshell

In [86]:
doc = nlp("I am some really difficult text to read because I use obnoxiously large words.")
print(doc._.flesch_kincaid_grade_level)
print(doc._.smog)

8.412857142857145
0


**Manual example:** FOG index

In [87]:
import syllapy

In [88]:
def calculate_fog(document):
    doc = nlp(document, disable=['tagger', 'ner', 'entity_linker', 'textcat', 'entitry_ruler'])
    sen_list = list(doc.sents)
    num_sen = len(sen_list)

    num_words = 0
    num_complex_words = 0
    for sen_obj in sen_list:
        words_in_sen = [token.text for token in sen_obj if token.is_alpha]
        num_words += len(words_in_sen)
        num_complex  = 0
        for word in words_in_sen:
            num_syl = syllapy.count(word.lower())
            if num_syl > 2:
                num_complex += 1
        num_complex_words += num_complex
        
    fog = 0.4 * ((num_words / num_sen) + ((num_complex_words / num_words)*100))
    return {'fog' : fog, 
            'num_sen' : num_sen, 
            'num_words' : num_words, 
            'num_complex_words' : num_complex_words}

In [89]:
calculate_fog(example_paragraph)



{'fog': 149.31494252873563,
 'num_sen': 1,
 'num_words': 348,
 'num_complex_words': 88}

## Text similarity/Metin benzerliği

### Using `fuzzywuzzy`

In [91]:
!pip install fuzzywuzzy

Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0



[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [92]:
from fuzzywuzzy import fuzz



In [93]:
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")

91

### Using `spacy`

Spacy can provide a similary score based on the semantic similarity ([link](https://spacy.io/usage/vectors-similarity))

Spacy, anlamsal benzerliğe dayalı olarak bir benzerlik puanı sağlayabilir ([link](https://spacy.io/usage/vectors-similarity))

In [94]:
tokens_1 = nlp("fuzzy wuzzy was a bear")
tokens_2 = nlp("wuzzy fuzzy was a bear")

tokens_1.similarity(tokens_2)

0.999999967579795

In [95]:
tokens_1 = nlp("Tom believes German cars are the best.")
tokens_2 = nlp("Sarah recently mentioned that she would like to go on holiday to Germany.")

tokens_1.similarity(tokens_2)

0.5968928111786671

### <span style="text-decoration: underline;">Term (dictionary) counting/Terim (sözlük) sayımı</span><a id='dict_counting'></a> [(to top)](#toc)

A common technique for basic NLP insights is to create simple metrics based on term counts. 

These are relatively easy to implement.

#### -----------------------------------

Temel NLP içgörüleri için yaygın bir teknik, terim sayımlarına dayalı basit ölçümler oluşturmaktır.

Bunların uygulanması nispeten kolaydır.

### Example 1:

In [96]:
word_dictionary = ['soft', 'first', 'most', 'be']

In [97]:
for word in word_dictionary:
    print(word, example_paragraph.count(word))

soft 3
first 0
most 0
be 8


### Example 2:

In [98]:
pos = ['great', 'agree', 'increase']
neg = ['bad', 'disagree', 'decrease']

sentence = '''According to the president everything is great, great, 
and great even though some people might disagree with those statements.'''

pos_count = 0
for word in pos:
    pos_count += sentence.lower().count(word)
print(pos_count)

neg_count = 0
for word in neg:
    neg_count += sentence.lower().count(word)
print(neg_count)

pos_count / (neg_count + pos_count)

4
1


0.8

Getting the total number of words is also easy:

Toplam kelime sayısını elde etmek de kolaydır:

In [99]:
num_tokens = len([token for token in nlp(sentence) if token.is_alpha])
num_tokens

19

#### Example 3:

We can also save the count per word

Kelime başına sayımı da kaydedebiliriz

In [100]:
pos_count_dict = {}
for word in pos:
    pos_count_dict[word] = sentence.lower().count(word)

In [101]:
pos_count_dict

{'great': 3, 'agree': 1, 'increase': 0}

*Note:* `.lower()` is actually quite slow, if you have a lot of words / sentences it is recommend to minimize the amount of `.lower()` operations that you have to make.

# <span style="text-decoration: underline;">Represent text numerically/Metni sayısal olarak temsil etme</span><a id='text_numerical'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">Bag of Words</span><a id='bows'></a> [(to top)](#toc)

Sklearn includes the `CountVectorizer` and `TfidfVectorizer` function.  

For details, see the documentation:  
[TF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)  
[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

*Note 1:* these functions also provide a variety of built-in preprocessing options (e.g. ngrames, remove stop words, accent stripper).

*Note 2:* example based on the following website [click here](http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

### ----------------------------------------------------------------------

Sklearn, "CountVectorizer" ve "TfidfVectorizer" işlevini içerir.

Ayrıntılar için belgelere bakın:
[TF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)

*Not 1:* bu işlevler, çeşitli yerleşik ön işleme seçenekleri de sağlar (ör. ngramlar, durdurma sözcüklerini kaldır, aksan sıyırıcı).

*Not 2:* aşağıdaki web sitesine dayanan örnek [burayı tıklayın](http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html)

In [122]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

### <span style="text-decoration: underline;">TF-IDF</span><a id='tfidf'></a> [(to top)](#toc)

In [106]:
transformer = TfidfVectorizer(stop_words='english')
tfidf = transformer.fit_transform([doc_1, doc_2, doc_3, doc_4])

In [137]:
for doc_vector in tfidf.toarray():
    print(doc_vector)

[0.78528828 0.         0.         0.6191303  0.         0.        ]
[0.         0.47380449 0.         0.         0.47380449 0.74230628]
[0.         0.53256952 0.         0.65782931 0.53256952 0.        ]
[0.         0.36626037 0.57381765 0.         0.73252075 0.        ]


### More elaborate example:
### Daha ayrıntılı örnek:

In [138]:
clean_paragraphs = []
for author, value in spacy_text_clean.items():
    for article in value:
        clean_paragraphs.append(' '.join([x for x in article]))

In [139]:
len(clean_paragraphs)

2500

In [140]:
transformer = TfidfVectorizer(stop_words='english')
tfidf_large = transformer.fit_transform(clean_paragraphs)

In [141]:
print('Number of vectors/vektör sayısı:', len(tfidf_large.toarray()))
print('Number of words in dictionary/Sözlükteki kelime sayısı:', len(tfidf_large.toarray()[0]))

Number of vectors: 2500
Number of words in dictionary: 27743


In [142]:
tfidf_large

<2500x27743 sparse matrix of type '<class 'numpy.float64'>'
	with 446636 stored elements in Compressed Sparse Row format>

## <span style="text-decoration: underline;">Word Embeddings</span><a id='word_embed'></a> [(to top)](#toc)

### <span style="text-decoration: underline;">Spacy</span><a id='spacyEmbedding'></a> [(to top)](#toc)

The `en_core_web_lg` language model comes with GloVe vectors trained on the Common Crawl dataset ([link](https://spacy.io/models/en#en_core_web_lg))

`en_core_web_lg` dil modeli, Common Crawl veri setinde eğitilmiş GloVe vektörleriyle birlikte gelir ([link](https://spacy.io/models/en#en_core_web_lg))

In [143]:
tokens = nlp("The Dutch word for peanut butter is 'pindakaas', did you know that? This is a typpo.")

for token in tokens:
    if token.is_alpha:
        print(token.text, token.has_vector, token.vector_norm, token.is_oov)

The True 76.91735 False
Dutch True 45.816505 False
word True 60.706367 False
for True 69.12914 False
peanut True 30.872149 False
butter True 45.09008 False
is True 110.41255 False
pindakaas False 0.0 True
did True 70.34003 False
you True 70.9396 False
know True 50.171894 False
that True 57.417362 False
This True 62.56213 False
is True 110.41255 False
a True 112.98545 False
typpo False 0.0 True


In [144]:
token = nlp('Car')
print('The token: "{}" has the following vector (dimension/aşağıdaki vektöre sahiptir (boyut: {})'.format(token.text, len(token.vector)))
token.vector

The token: "Car" has the following vector (dimension: 300)


array([ 5.6920e+00, -3.2402e+00, -3.0820e+00, -6.7902e-01, -9.6719e-01,
       -2.9792e+00,  2.8736e+00, -4.9361e+00, -7.3910e-01,  4.0223e+00,
       -5.3932e+00, -1.3357e+00, -4.5541e+00, -1.4588e+00, -7.1353e+00,
        3.4909e+00,  2.6185e+00,  1.9497e+00, -4.6816e-01,  2.7521e+00,
       -1.5615e+00,  1.1734e+00,  9.5472e-01,  1.1160e-01,  6.9507e+00,
        1.2640e+00, -3.9840e+00, -6.4382e+00, -3.2300e+00, -3.0197e+00,
        3.2735e+00,  3.3488e+00, -6.4635e-03,  7.1386e+00, -2.0421e+00,
        6.4661e+00,  2.1496e-01,  2.7396e+00,  6.5596e-01, -5.7375e+00,
        4.3028e+00, -2.8573e-01, -1.1799e+00,  9.6603e+00, -7.9194e+00,
       -2.9292e+00, -1.6170e+00,  8.6791e+00,  1.3318e+00, -2.1795e+00,
       -1.5050e+00,  1.6407e+00,  1.7286e+00,  2.0606e+00,  5.6960e-01,
        4.6689e-02,  1.3460e+00, -1.9011e+00,  6.3613e+00,  2.2340e+00,
       -6.6167e+00, -2.1176e+00, -6.1828e+00, -1.2253e+00,  3.5796e+00,
       -5.7160e+00, -5.0363e+00, -5.2294e+00,  5.5554e+00,  3.18

### <span style="text-decoration: underline;">Word2Vec</span><a id='Word2Vec'></a> [(to top)](#toc)

*Note:* you might have to run `nltk.download('brown')` to install the NLTK corpus files

*Not:* NLTK korpus dosyalarını yüklemek için `nltk.download('brown')` komutunu çalıştırmanız gerekebilir.

In [149]:
import gensim
from nltk.corpus import brown

In [152]:
import nltk

nltk.download('brown')
from nltk.corpus import brown

sentences = brown.sents()


[nltk_data] Downloading package brown to C:\Users\Serkan
[nltk_data]     POLAT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


In [153]:
model = gensim.models.Word2Vec(sentences, min_count=1)

Save model

In [154]:
model.save('brown_model')

Load model

In [155]:
model = gensim.models.Word2Vec.load('brown_model')

Find words most similar to 'mother':

'Anne' kelimesine en çok benzeyen kelimeleri bulun:

In [156]:
print(model.wv.most_similar("mother"))

[('father', 0.9800074100494385), ('husband', 0.9679511189460754), ('wife', 0.9455371499061584), ('son', 0.9285951256752014), ('friend', 0.91536945104599), ('nickname', 0.913755476474762), ('voice', 0.9066001176834106), ('brother', 0.8932086229324341), ('addiction', 0.8855394721031189), ('patient', 0.8833369612693787)]


Find the odd one out:

Garip olanı bulun:

In [157]:
print(model.wv.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


In [158]:
print(model.wv.doesnt_match("pizza pasta garden fries".split()))

pizza


Retrieve vector representation of the word "human"

"İnsan" kelimesinin vektör temsilini alın

In [159]:
model.wv['human']

array([-5.07787943e-01,  3.32118362e-01,  5.27413070e-01,  5.32106221e-01,
       -5.47175169e-01, -4.80501473e-01,  1.05670822e+00,  1.30387366e+00,
       -5.38607657e-01, -6.87516272e-01, -1.76177230e-02, -5.80767393e-01,
        5.06372988e-01, -1.08514535e+00,  1.85228974e-01, -6.29287004e-01,
        2.00535670e-01, -1.04284286e-01, -7.37063050e-01, -1.06254590e+00,
        4.54772562e-01,  1.31828815e-01,  7.16318429e-01,  2.13544697e-01,
       -2.55210102e-01,  8.32315758e-02,  1.23568615e-02,  2.72217356e-02,
       -8.63463581e-01,  6.19389899e-02,  4.20464873e-01, -4.16138262e-01,
        1.09194183e+00, -4.47616041e-01,  9.87596363e-02,  3.75444055e-01,
        1.20622888e-02, -1.00792325e+00, -2.46500000e-01,  1.27593264e-01,
        1.97680295e-01, -4.25558358e-01,  7.74740040e-01, -4.79565002e-03,
        6.24176204e-01,  2.57402778e-01, -4.71468091e-01, -1.62685752e-01,
       -3.23666602e-01,  2.41536587e-01,  2.27633834e-01, -4.73424256e-01,
       -6.79329515e-01, -

# <span style="text-decoration: underline;">Statistical models/İstatistiksel modeller</span><a id='stat_models'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">"Traditional" machine learning</span><a id='trad_ml'></a> [(to top)](#toc)

## <span style="text-decoration: underline;">"Geleneksel" makine öğrenimi</span><a id='trad_ml'></a> [(to top)](#toc)

The library to use for machine learning is scikit-learn (["sklearn"](http://scikit-learn.org/stable/index.html)).

Makine öğrenimi için kullanılacak kitaplık scikit-learn'dür (["sklearn"](http://scikit-learn.org/stable/index.html)).

## <span>Supervised</span><a id='trad_ml_supervised'></a> [(to top)](#toc)

In [160]:
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import joblib

In [161]:
import pandas as pd
import numpy as np

### Convert the data into a pandas dataframe (so that we can input it easier)
### Verileri bir panda veri çerçevesine dönüştürün (böylece daha kolay girebiliriz)

In [162]:
article_list = []
for author, value in spacy_text_clean.items():
    for article in value:
        article_list.append((author, ' '.join([x for x in article])))

In [163]:
article_df = pd.DataFrame(article_list, columns=['author', 'text'])

In [164]:
article_df.sample(5)

Unnamed: 0,author,text
1861,PeterHumphrey,western countries geared quietly grant asylum ...
37,AaronPressman,commerce department showed unexpected degree f...
1906,PierreTran,sophie ex wall street lawyer shareholder lobby...
2076,SamuelPerry,hewlett packard jumped solidly microsoft windo...
139,AlexanderSmith,natwest bank admitted thursday multi million p...


### Split the sample into a training and test sample/Eğit ve değerlendir işlevi

In [165]:
X_train, X_test, y_train, y_test = train_test_split(article_df.text, article_df.author, test_size=0.20, random_state=3561)

In [166]:
print(len(X_train), len(X_test))

2000 500


### Train and evaluate function/Eğit ve değerlendir işlevi

Simple function to train (i.e. fit) and evaluate the model

Modeli eğitmek (yani uygun) ve değerlendirmek için basit işlev

In [167]:
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

### <span>Naïve Bayes estimator</span><a id='trad_ml_supervised_nb'></a> [(to top)](#toc)

### <span>Naïve Bayes tahmincisi</span><a id='trad_ml_supervised_nb'></a> [(to top)](#toc)

In [168]:
from sklearn.naive_bayes import MultinomialNB

Define pipeline

Ardışık düzen tanımla

In [169]:
clf = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', MultinomialNB(alpha = 1,
                          fit_prior = True
                          )
    ),
])

Train and show evaluation stats

Değerlendirme istatistiklerini eğitin ve gösterin

In [170]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.8485
Accuracy on testing set:
0.714
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.90      1.00      0.95         9
       AlanCrosby       0.58      0.92      0.71        12
   AlexanderSmith       0.86      0.60      0.71        10
  BenjaminKangLim       0.75      0.27      0.40        11
    BernardHickey       0.75      0.30      0.43        10
      BradDorfman       0.80      1.00      0.89         8
 DarrenSchuettler       0.58      0.78      0.67         9
      DavidLawder       1.00      0.60      0.75        10
    EdnaFernandes       1.00      0.67      0.80         9
      EricAuchard       0.86      0.67      0.75         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.59      1.00      0.74        10
 HeatherScoffield       0.83      0.56      0.67         9
       JanLopatka       0.27      0.33      0.30         9
    JaneMacartney       0.3

Save results/Sonuçları kaydet

In [171]:
joblib.dump(clf, 'naive_bayes_results.pkl')

['naive_bayes_results.pkl']

Predict out of sample:

Örnek dışı tahmin:

In [172]:
example_y, example_X = y_train[33], X_train[33]

In [173]:
print('Actual author/gerçek yazar:', example_y)
print('Predicted author/Tahmin edilen yazar:', clf.predict([example_X])[0])

Actual author: AaronPressman
Predicted author: AaronPressman


### <span>Support Vector Machines (SVM)</span><a id='trad_ml_supervised_svm'></a> [(to top)](#toc)

### <span>Destek Vektör Makineleri (SVM)</span><a id='trad_ml_supervised_svm'></a> [(to top)](#toc)

In [174]:
from sklearn.svm import SVC

Define pipeline

Ardışık düzen tanımla

In [175]:
clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode',
                             lowercase = True,
                            max_features = 1500,
                            stop_words='english'
                            )),
        
    ('clf', SVC(kernel='rbf' ,
                C=10, gamma=0.3)
    ),
])

*Note:* The SVC estimator is very sensitive to the hyperparameters!

*Not:* SVC tahmincisi, hiperparametrelere karşı çok hassastır!

Train and show evaluation stats

Değerlendirme istatistiklerini eğitin ve gösterin

In [176]:
train_and_evaluate(clf_svm, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.997
Accuracy on testing set:
0.828
Classification Report:
                   precision    recall  f1-score   support

    AaronPressman       0.80      0.89      0.84         9
       AlanCrosby       0.79      0.92      0.85        12
   AlexanderSmith       1.00      0.70      0.82        10
  BenjaminKangLim       0.57      0.36      0.44        11
    BernardHickey       1.00      0.50      0.67        10
      BradDorfman       0.88      0.88      0.88         8
 DarrenSchuettler       1.00      0.89      0.94         9
      DavidLawder       1.00      0.60      0.75        10
    EdnaFernandes       0.75      1.00      0.86         9
      EricAuchard       0.89      0.89      0.89         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.77      1.00      0.87        10
 HeatherScoffield       0.90      1.00      0.95         9
       JanLopatka       0.57      0.44      0.50         9
    JaneMacartney       0.33

Save results

In [177]:
joblib.dump(clf_svm, 'svm_results.pkl')

['svm_results.pkl']

Predict out of sample:

Örnek dışı tahmin:

In [178]:
example_y, example_X = y_train[33], X_train[33]

In [179]:
print('Actual author/gerçek yazar:', example_y)
print('Predicted author/Tahmin edilen yazar:', clf_svm.predict([example_X])[0])

Actual author: AaronPressman
Predicted author: AaronPressman


## <span>Model Selection and Evaluation</span><a id='trad_ml_eval'></a> [(to top)](#toc)

## <span>Model Seçimi ve Değerlendirmesi</span><a id='trad_ml_eval'></a> [(to top)](#toc)

Both the `TfidfVectorizer` and `SVC()` estimator take a lot of hyperparameters.  

It can be difficult to figure out what the best parameters are.

We can use `GridSearchCV` to help figure this out.

#### ------------------------------------------------------------------------

Hem "TfidfVectorizer" hem de "SVC()" tahmincisi çok sayıda hiperparametre alır.

En iyi parametrelerin ne olduğunu anlamak zor olabilir.

Bunu anlamaya yardımcı olması için "GridSearchCV" kullanabiliriz.

In [180]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score

First we define the options that should be tried out:

Öncelikle denenmesi gereken seçenekleri tanımlıyoruz:

In [181]:
clf_search = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SVC())
])
parameters = { 'vect__stop_words': ['english'],
                'vect__strip_accents': ['unicode'],
              'vect__max_features' : [1500],
              'vect__ngram_range': [(1,1), (2,2) ],
             'clf__gamma' : [0.2, 0.3, 0.4], 
             'clf__C' : [8, 10, 12],
              'clf__kernel' : ['rbf']
             }

Run everything:

Her şeyi çalıştır:

In [182]:
grid = GridSearchCV(clf_search, 
                    param_grid=parameters, 
                    scoring=make_scorer(f1_score, average='micro'), 
                    n_jobs=-1
                   )
grid.fit(X_train, y_train)    

*Note:* if you are on a powerful (preferably unix system) you can set n_jobs to the number of available threads to speed up the calculation

*Not:* güçlü bir (tercihen unix sistemi) kullanıyorsanız, hesaplamayı hızlandırmak için n_jobs'u mevcut iş parçacığı sayısına ayarlayabilirsiniz.

In [183]:
print("The best parameters are/En iyi parametreler %s with a score of/bir puanla %0.2f" % (grid.best_params_, grid.best_score_))
y_true, y_pred = y_test, grid.predict(X_test)
print(metrics.classification_report(y_true, y_pred))

The best parameters are {'clf__C': 10, 'clf__gamma': 0.4, 'clf__kernel': 'rbf', 'vect__max_features': 1500, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english', 'vect__strip_accents': 'unicode'} with a score of 0.80
                   precision    recall  f1-score   support

    AaronPressman       0.80      0.89      0.84         9
       AlanCrosby       0.79      0.92      0.85        12
   AlexanderSmith       1.00      0.70      0.82        10
  BenjaminKangLim       0.57      0.36      0.44        11
    BernardHickey       1.00      0.50      0.67        10
      BradDorfman       0.88      0.88      0.88         8
 DarrenSchuettler       1.00      0.89      0.94         9
      DavidLawder       1.00      0.60      0.75        10
    EdnaFernandes       0.80      0.89      0.84         9
      EricAuchard       0.89      0.89      0.89         9
   FumikoFujisaki       1.00      1.00      1.00        10
   GrahamEarnshaw       0.77      1.00      0.87        10
 HeatherS