Before answering this question, see where you are in the data-processing workflow:

![Feature Engineering](assets/feature_engineering.png)

As the diagram above shows, with this checkpoint, you'll start to explore how to convert text data into numerical form. This is the feature-engineering step that you need to do before feeding your data into any machine-learning algorithm. Converting text into numerical form is often called *language modeling* in NLP jargon, and it's one of the most active research areas in NLP. This process is called language modeling because of semantics; a good numerical representation of the text should be able to capture the meaning of the words and their relationships between each other. You'll learn about semantics in the next checkpoint. For now, you'll just focus on a simple way to represent words in numerical form.

To accommodate the feature-generation techniques that you'll learn here and to see how they perform on a machine-learning task, you'll feed your new numerical features into some machine-learning algorithms. You'll also make some classifications to demonstrate the pros and cons of your feature-engineering methods. Hence, your first hands-on NLP application in this module will be *text classification*. 

Continue on to get started!

# Bag-of-words

The first feature-generation approach that you'll learn about here is called *bag-of-words* (BoW). BoW is quite simple: your goal is to create a feature matrix such that the rows are observations, and each column is a unique word in your vocabulary. You fill in this matrix by counting how many times each word appears in each observation. You then use those counts as features. 

As mentioned, BoW is simple and very easy to implement using libraries like scikit-learn. In the jargon of scikit-learn, generating BoW features is called `CountVectorizer`, as you'll see shortly. However, before moving on to implement the BoW approach, you'll need to do some data cleaning. 

Begin by importing the libraries that you'll be using:

In [1]:
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
from nltk.corpus import gutenberg
import nltk
import warnings
warnings.filterwarnings("ignore")

# nltk.download('gutenberg')
# !python -m spacy download en

Now, write a helper function called `text_cleaner` for cleaning the text. Specifically, remove some punctuation marks and numbers from the text:

In [2]:
# Utility function for standard text cleaning
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation that spaCy doesn't
    # recognize: the double dash '--'. Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

Next, load Jane Austen's *Persuasion* and Lewis Carroll's *Alice's Adventures in Wonderland* from NLTK's Gutenberg module. In this checkpoint, you'll be working on these two texts, and your ultimate goal will be to distinguish the authors from their sentences. Hence, your unit of observation (your *documents*) will be the sentences of these novels.

After you load the novels, do some data cleaning. First, remove the chapter indicators from the novels. Then apply the `text_cleaner` function from above to clean up some punctuation marks and the numbers:

In [24]:
# Load and clean the data
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

The cleaned texts are stored in two variables called `alice` and `persuasion`. Note that you haven't split the texts into sentences yet. You'll do that using spaCy. For that purpose, load spaCy's English module and use spaCy to parse both the `alice` and `persuasion` texts:

In [25]:
# Parse the cleaned novels. This can take some time.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [45]:
alice_doc



You can split your texts into sentences now. This process is easy using spaCy. Because you've already parsed your documents with spaCy, you can now use spaCy's functionalities. In this case, spaCy will take care of deriving the sentences from the texts. What you need to do is to iterate over the parsed documents after calling the `.sents` attribute. With the following code, you can iterate using list comprehension.

In [28]:
# Group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

alice_sents = alice_sents[0:1000]
persuasion_sents = persuasion_sents[0:1000]

# Combine the sentences from the two novels into one DataFrame
sentences = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])
sentences.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [29]:
len(sentences)

2000

As a result, your dataset consists of two columns. The first column has the sentences, and the second column has the authors. Before jumping into BoW, you need to remove stop words and punctuation marks, and then convert your tokens to lemmas or stems. In this example, you'll lemmatize your tokens. Again, you'll make use of the attributes of the documents that spaCy parsed.

In [30]:
# Get rid of stop words and punctuation,
# and lemmatize the tokens
for i, sentence in enumerate(sentences["text"]):
    sentences.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

Now you can start converting the text in the first column of your dataset into a numerical form. As mentioned before, you'll use the BoW approach. For this purpose, use `CountVectorizer` from scikit-learn, as follows:

In [31]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)

And that's all! Now, check out your new dataset:

In [32]:
sentences.head()

Unnamed: 0,29th,abbreviation,abide,ability,able,abode,abominate,abroad,absence,absent,absolutely,absurd,abuse,accept,acceptable,acceptance,accession,accidentally,accommodation,accompany,accomplishment,accordingly,account,accurately,accuse,accustomary,accustomed,acknowledge,acquaint,acquaintance,acquainted,acquire,acre,act,action,active,actual,actually,actuate,acute,...,word,work,world,worm,worry,worse,worsting,worth,wound,wow,wrap,wreck,wretched,wretchedly,wretchedness,wriggle,wrinkle,wrist,write,writing,wrong,yard,yawn,ye,year,yearly,yelp,yeoman,yer,yes,yesterday,young,youth,youthful,zeal,zealand,zealous,zigzag,text,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,shall late,Carroll


As you can see, you now have a dataset that matches the format that you're used to in this program. It's in tabular form: observations are the rows and features are the columns. More importantly, you converted text into a numerical form, so you can apply machine-learning algorithms using these as input. This enables you to move to the modeling phase, as indicated below.

![Modeling](assets/modeling.png)

## BoW in action

Now, give the bag-of-words features a whirl by trying some machine-learning algorithms:

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9541666666666667

Test set score: 0.8725
----------------------Random Forest Scores----------------------
Training set score: 0.9741666666666666

Test set score: 0.835
----------------------Gradient Boosting Scores----------------------
Training set score: 0.86

Test set score: 0.80875


It looks like logistic regression and random forest overfit. Overfitting is a known problem when using the bag-of-words approach because it involves throwing a massive number of features at a model. Some of those features (in this case, word frequencies) will capture noise in the training set. Because overfitting is also a known problem with random forests, the divergence between training score and test score is expected. On the other hand, training and test scores from gradient boositng are close to each other.

# N-grams: words in context

Consider the word `vain` in these two sentences:

    “She labored in vain; the rock would not move.” 

    “She was so vain, her bathroom mirror was covered in lip prints.”

In both sentences, `vain` is an adjective. In the first sentence, it signals a lack of success. In the second sentence, the same word means vanity. Since the two usages can't be distinguished by their part of speech, how can you tell them apart?

*N-grams* incorporate context information by creating features made up of a series of consecutive words. The *n* refers to the number of words included in the series. For example, the 2-gram representation of the first sentence would be as follows:

    (She labored), (labored in), (in vain), (vain the), (the rock), (rock would), (would not), (not move).

The 3-gram representation of the second sentence would be as follows:

    (She was so), (was so vain), (so vain her), (vain her bathroom), (her bathroom mirror), (bathroom mirror was), (mirror was covered), (was covered in), (covered in lip), (in lip prints).

Each of the word sets could then operate as its own feature. N-grams can be used to create term-document matrices (though it would now be n-gram-document matrices), or they can be used in topic modeling. In addition, n-grams are useful for text prediction; they can be used to determine what words are most likely to follow in a sentence, phrase, or search query.

For a sentence with *X* words, there will be $X-(N-1)$ n-grams. Two-gram phrases are also called *bigrams*, three-gram phrases are called *trigrams*, and so on.

## Why use single-word models?

Given the benefits of incorporating word context for distinguishing between different meanings of a word, why would any NLP practitioner worth their salt ever use simple word features? Well, models based on single words have several advantages:

* N-gram models are considerably more sparse than single-word models. The two `vain` sentences above share four words (`she`, `in`, `vain`, `the`) but zero n-grams. Sparseness does mean that an n-gram model can be stored in a more memory-efficient way. For example, imagine a dict that only lists the n-grams that are present in each sentence, rather than a set of columns with `1` if an n-gram is present and `0` otherwise. But it also means that a larger corpus may be needed to detect any shared patterns across documents. In other words, n-gram models may need more documents before they start to give good results.

* Single-word models are straightforward to implement, while models incorporating n-grams are more sensitive to fine distinctions of meaning. Which to choose depends on the goals of the NLP project and the tradeoffs in time and performance for the specific corpus that you are modeling.

## Example of 2-grams

Implementing n-grams is quite straightforward using scikit-learn's `CountVectorizer`. The only thing that you need to do is to give a tuple of range as values to `ngram_range` (a parameter of `CountVectorizer`). As the code below demonstrates, you need to provide a value for the parameter `ngram_range=(2,2)` inside `CountVectorizer`. This means that the vectorizer will produce 2-gram features. If you were to give `ngram_range=(1,2)` as the value, then the vectorizer would produce both 1-gram and 2-gram features together. But don't do that just yet—that task will be saved for this checkpoint's assignment.

Now, generate your 2-grams and see what it looks like:

In [34]:
# Use 2-grams
vectorizer = CountVectorizer(analyzer='word', ngram_range=(2,2))
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)
sentences.head()

Unnamed: 0,29th september,abbreviation living,abide figure,ability difficulty,able convince,able devise,able eat,able far,able leave,able persuade,able ring,able set,able watch,abroad intention,abroad supposition,abroad talent,abroad work,absence disinterested,absence home,absence young,absent beginning,absolutely hopeless,absurd carry,absurd look,absurd resume,absurd suspicion,absurd use,abuse want,acceptable miss,acceptance elegant,accession frightened,accidentally hear,accommodation arrangement,accommodation board,accommodation man,accompany husband,accomplishment home,accomplishment like,accordingly go,accordingly removal,...,young aunt,young certainly,young child,young couple,young crab,young fellow,young friend,young gentle,young girl,young hayters,young know,young lady,young man,young miss,young people,young person,young sister,young squire,young woman,youth bloom,youth early,youth father,youth fine,youth hardly,youth jaw,youth kill,youth learn,youth like,youth mention,youth say,youth spring,youth vigour,youthful infatuation,zeal dwell,zeal sport,zealand australia,zealous subject,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,shall late,Carroll


As you can see, your new features are 2-gram. Next, build the same machine-learning models that you built before for the 1-gram case, but this time, use 2-gram as your features:

In [35]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9441666666666667

Test set score: 0.695
----------------------Random Forest Scores----------------------
Training set score: 0.945

Test set score: 0.6675
----------------------Gradient Boosting Scores----------------------
Training set score: 0.6891666666666667

Test set score: 0.675


The results seem worse than 1-gram! Even the overfitting in the logistic regression and the random forest is higher than before. That's because in the 2-gram case, you have more features than you have in 1-gram. One possible solution to increase the performance of the models is using 1-gram and 2-gram together as features. This will be one of your tasks in the assignments.

============================================================================================================


## 1) Your task is to increase the performance of the models that you implemented in the bank-of-words example. Here are some suggested avenues of investigation:

* Other modeling techniques and models

* Making more features that take advantage of the spaCy information, such as grammar, phrases, parts of speech, and so forth

* Making sentence-level features, such as the number of words and amount of punctuation

* Including contextual information, such as the length of previous and next sentences, words repeated from one sentence to the next, and so on

* Or anything else that your heart desires

## Compare your models' performances with those of the example.



In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(binary=True)
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)

In [41]:
sentences.head()

Unnamed: 0,29th,abbreviation,abide,ability,able,abode,abominate,abroad,absence,absent,absolutely,absurd,abuse,accept,acceptable,acceptance,accession,accidentally,accommodation,accompany,accomplishment,accordingly,account,accurately,accuse,accustomary,accustomed,acknowledge,acquaint,acquaintance,acquainted,acquire,acre,act,action,active,actual,actually,actuate,acute,...,word,work,world,worm,worry,worse,worsting,worth,wound,wow,wrap,wreck,wretched,wretchedly,wretchedness,wriggle,wrinkle,wrist,write,writing,wrong,yard,yawn,ye,year,yearly,yelp,yeoman,yer,yes,yesterday,young,youth,youthful,zeal,zealand,zealous,zigzag,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.228787,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll


In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9483333333333334

Test set score: 0.8625
----------------------Random Forest Scores----------------------
Training set score: 0.9733333333333334

Test set score: 0.82375
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8691666666666666

Test set score: 0.795


In [54]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

docs = [alice_doc, persuasion_doc]
tagged_documents = [TaggedDocument(doc, [i]) 
             for i, doc in enumerate(docs)]

In [56]:
tagged_documents



In [57]:
model = Doc2Vec(tagged_documents).build_vocab()

RuntimeError: ignored

In [None]:
doc2vec = pd.DataFrame([[document]+list(model[document]) 
                        for document in range(len(tagged_documents))]).drop(0, axis=1)

In [None]:
doc2vec.head()

## 2) In the 2-gram example above, you only used 2-gram as your features. This time, use both 1-gram and 2-gram features together as your feature set. Run the same models as in the example and compare the results.

In [36]:
# Use 1 and 2-grams
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2))
X = vectorizer.fit_transform(sentences["text"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences = pd.concat([bow_df, sentences[["text", "author"]]], axis=1)
sentences.head()

Unnamed: 0,29th,29th september,abbreviation,abbreviation living,abide,abide figure,ability,ability difficulty,able,able convince,able devise,able eat,able far,able leave,able persuade,able ring,able set,able watch,abode,abominate,abroad,abroad intention,abroad supposition,abroad talent,abroad work,absence,absence disinterested,absence home,absence young,absent,absent beginning,absolutely,absolutely hopeless,absurd,absurd carry,absurd look,absurd resume,absurd suspicion,absurd use,abuse,...,young friend,young gentle,young girl,young hayters,young know,young lady,young man,young miss,young people,young person,young sister,young squire,young woman,youth,youth bloom,youth early,youth father,youth fine,youth hardly,youth jaw,youth kill,youth learn,youth like,youth mention,youth say,youth spring,youth vigour,youthful,youthful infatuation,zeal,zeal dwell,zeal sport,zealand,zealand australia,zealous,zealous subject,zigzag,zigzag go,text,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Alice begin tired sit sister bank have twice p...,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,oh dear,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,shall late,Carroll


In [37]:
Y = sentences['author']
X = np.array(sentences.drop(['text','author'], 1))

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9683333333333334

Test set score: 0.85625
----------------------Random Forest Scores----------------------
Training set score: 0.9741666666666666

Test set score: 0.83
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8575

Test set score: 0.80875
