# Project: authorship attribution

## by: Dekel Mor, Matan Ramati


in this project, our purpose is to classify a text to the most fitting author from number of authors.
in order to do that we will make research to accomplish a few targets: 
* identify useful features from texts
* finding efficient ways to extract that features 
* explore algorithms to find the most suitable and accurate one for the task

The authorship attribution tool can be useful and powerful. until now we had facial recognition, voice recognition, and fingerprint recognition, all of these cant help when you're stuck with a piece of text.
examples for interesting uses: identify impersonators, recognize criminals by text evidence, protect intellectual property, etc.

the Data will be an array that contains an array of text words and labels (author name).
the source of the data can be Facebook posts, letters, opinion pieces, and other texts that are not in strict formal form.





In [18]:
import nltk
import pandas as pd
import numpy as np
import random 
from nltk import NaiveBayesClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
dataset = pd.read_csv('blogtext.csv', delimiter=',')

In [24]:
sample = dataset[:500]
author_text = sample[['id','text']]
author_text

Unnamed: 0,id,text
0,2059027,"Info has been found (+/- 100 pages,..."
1,2059027,These are the team members: Drewe...
2,2059027,In het kader van kernfusie op aarde...
3,2059027,testing!!! testing!!!
4,3581210,Thanks to Yahoo!'s Toolbar I can ...
...,...,...
495,3022585,This is a quote from a graduate of my c...
496,3022585,"Yesterday, Bunny Day, I was driving out..."
497,3022585,This past week was my Spring Break from...
498,3022585,And by the devil I mean Uncle Sam. I n...


we chose specific author to watch all his blog posts:

In [6]:
spec_author_texts = author_text.where(author_text['id']==3581210).dropna().text
spec_author_texts

4                  Thanks to Yahoo!'s Toolbar I can ...
5                  I had an interesting conversation...
6                  Somehow Coca-Cola has a way of su...
7                  If anything, Korea is a country o...
8                  Take a read of this news article ...
                            ...                        
69                 Korea's pretty funny sometimes.  ...
70                 Ya, I'm off to Canada/Vancouver a...
71                 Ah, finally...someone else I know...
72                 I think if I'm going to claim 여의도...
73                 Being gay in Korea is like being ...
Name: text, Length: 70, dtype: object

we chose one post to display his contents: 

In [7]:
spec_author_texts[5]

"             I had an interesting conversation with my Dad this morning.  We were talking about where Koreans put their money.  Invariably, they have a lot of real estate and cash.  (Cash would include short term investments under one year as well as savings accounts.)  The reason?  Real estate makes money here.  A lot of money.  I've seen surveys of Seoul real estate rising about 10-15% PER YEAR for long stretches, even after taking into account the 1997 Crisis (referred to as the IMF crisis here, although it was the IMF that bailed Korea out).  Compare that to Korean corporate bonds which fell 90-99% in 1997 and only modestly recovered, and a local stock market (represented by KOSPI, or their version of the Dow Jones Index) that has not gone appreciably above its 1980s high of 1,000 points (it is now about 800 points, see  urlLink link ) and you can see why real estate makes sense here.  But back to the conversation...I noted that here a 'real big' or 'elite' real estate investor ha

breaking the text to word tokens:

In [8]:
from nltk.tokenize import TweetTokenizer
tweet_tok = TweetTokenizer(preserve_case=False)
more_tok = tweet_tok.tokenize(spec_author_texts[5])
print(more_tok)

['i', 'had', 'an', 'interesting', 'conversation', 'with', 'my', 'dad', 'this', 'morning', '.', 'we', 'were', 'talking', 'about', 'where', 'koreans', 'put', 'their', 'money', '.', 'invariably', ',', 'they', 'have', 'a', 'lot', 'of', 'real', 'estate', 'and', 'cash', '.', '(', 'cash', 'would', 'include', 'short', 'term', 'investments', 'under', 'one', 'year', 'as', 'well', 'as', 'savings', 'accounts', '.', ')', 'the', 'reason', '?', 'real', 'estate', 'makes', 'money', 'here', '.', 'a', 'lot', 'of', 'money', '.', "i've", 'seen', 'surveys', 'of', 'seoul', 'real', 'estate', 'rising', 'about', '10-15', '%', 'per', 'year', 'for', 'long', 'stretches', ',', 'even', 'after', 'taking', 'into', 'account', 'the', '1997', 'crisis', '(', 'referred', 'to', 'as', 'the', 'imf', 'crisis', 'here', ',', 'although', 'it', 'was', 'the', 'imf', 'that', 'bailed', 'korea', 'out', ')', '.', 'compare', 'that', 'to', 'korean', 'corporate', 'bonds', 'which', 'fell', '90-99', '%', 'in', '1997', 'and', 'only', 'modest

filter stopwords("i" "and" "it" etc)

In [14]:
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english')) 
filtered_sentence = [w for w in more_tok if not w in stop_words] 
print(filtered_sentence)

['interesting', 'conversation', 'dad', 'morning', '.', 'talking', 'koreans', 'put', 'money', '.', 'invariably', ',', 'lot', 'real', 'estate', 'cash', '.', '(', 'cash', 'would', 'include', 'short', 'term', 'investments', 'one', 'year', 'well', 'savings', 'accounts', '.', ')', 'reason', '?', 'real', 'estate', 'makes', 'money', '.', 'lot', 'money', '.', "i've", 'seen', 'surveys', 'seoul', 'real', 'estate', 'rising', '10-15', '%', 'per', 'year', 'long', 'stretches', ',', 'even', 'taking', 'account', '1997', 'crisis', '(', 'referred', 'imf', 'crisis', ',', 'although', 'imf', 'bailed', 'korea', ')', '.', 'compare', 'korean', 'corporate', 'bonds', 'fell', '90-99', '%', '1997', 'modestly', 'recovered', ',', 'local', 'stock', 'market', '(', 'represented', 'kospi', ',', 'version', 'dow', 'jones', 'index', ')', 'gone', 'appreciably', '1980s', 'high', '1,000', 'points', '(', '800', 'points', ',', 'see', 'urllink', 'link', ')', 'see', 'real', 'estate', 'makes', 'sense', '.', 'back', 'conversation',

filter one character tokens and finding frequently words:

In [17]:
from nltk import FreqDist

oftenwords1 = []
for i in filtered_sentence:
    if len(i) > 1:
        oftenwords1.append(i)
FreqDist(oftenwords1)

FreqDist({'real': 8, 'usd': 8, 'estate': 7, 'year': 6, 'rent': 5, 'money': 4, 'one': 4, 'korea': 4, 'see': 4, 'urllink': 4, ...})

building data stracture that contain tokenized posts and the author id:

In [11]:
posts = []
for row in author_text.index:
    posts.append([(list(tweet_tok.tokenize(author_text.loc[row].text)),author_text.loc[row].id)])
posts[5]

[(['i',
   'had',
   'an',
   'interesting',
   'conversation',
   'with',
   'my',
   'dad',
   'this',
   'morning',
   '.',
   'we',
   'were',
   'talking',
   'about',
   'where',
   'koreans',
   'put',
   'their',
   'money',
   '.',
   'invariably',
   ',',
   'they',
   'have',
   'a',
   'lot',
   'of',
   'real',
   'estate',
   'and',
   'cash',
   '.',
   '(',
   'cash',
   'would',
   'include',
   'short',
   'term',
   'investments',
   'under',
   'one',
   'year',
   'as',
   'well',
   'as',
   'savings',
   'accounts',
   '.',
   ')',
   'the',
   'reason',
   '?',
   'real',
   'estate',
   'makes',
   'money',
   'here',
   '.',
   'a',
   'lot',
   'of',
   'money',
   '.',
   "i've",
   'seen',
   'surveys',
   'of',
   'seoul',
   'real',
   'estate',
   'rising',
   'about',
   '10-15',
   '%',
   'per',
   'year',
   'for',
   'long',
   'stretches',
   ',',
   'even',
   'after',
   'taking',
   'into',
   'account',
   'the',
   '1997',
   'crisis',
   '(',

In [30]:
import random
random.shuffle(posts)

as part of the research for the project we will try to find more useful features and check how much they contribute to the author prediction.
frequently words, Frequent use of syntax type and etc. 

the pre-process is still basic, we now just give the model the text text and the author label and let him classify.
we need to improve the input(maybe like above, tokenized word without stop words and punctuation marks).


first model to classify authors by text:


In [19]:
# Features
Xfeatures = author_text['text']
ylabels = author_text['id']

In [20]:
# Vectorization
cv = CountVectorizer()
X = cv.fit_transform(Xfeatures)

In [27]:
# Split dataset
x_train,x_test,y_train,y_test = train_test_split(X,ylabels,test_size=0.33,random_state=50)
x_train.shape

(335, 13501)

In [28]:
# Building Our Model
clf = MultinomialNB()
clf.fit(x_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [29]:
print("Accuracy of Model :",clf.score(x_test,y_test))

Accuracy of Model : 0.5515151515151515



as we see, the accuracy is not so good, we will try to improve it by using more features and instead of using the raw text of the blog to use words who frequently appears in the text.


After we will manage to achieve our main target of text authorship attribution and our algorithem will work as expected our goal is to adjust the algorithm in order to make it run on Hebrew texts as well.
We will try to achieve this goal using excisting open-source tools for Hebrew NLP (Entity recgonition, tokenizing, words tagging etc.) mainley from MILA's (Knowlegde Center For Processing Hebrew) website. (https://yeda.cs.technion.ac.il/eng/index.html) 
