<h1>NER using CRF<h1>

<h2>Instructions<h2>

- The header files and the data file(stored in path) is already loaded for you.

- Start with exploring data and understanding different tags.

- You must have noticed lot of NaN values, clean them up(Preferably using ffil)

- Function Sentence() has been defined to group the dataframe into sentences. Pass your dataframe as a parameter to the function and store the output in sObject.

- Take the sent parameter of sObject(i.e. sObject.sent) and store it in a variable called sentences. Print and see what it contains

- Functions word2Features(To convert words into features), sent2features(To get features from sentences with the help of word2Features),sent2label(To get labels from sentences) are already defined for your help. Make sure you understand what these functions do and how they do it.

- Get the features from sentences by passing the values of sentences(You will have to run a loop) to sent2features() function and store the features in variable X

- Get the labels from sentences by passing the values of sentences(You will have to run a loop) to sent2labels() function and store the features in variable y

- Split X,y into train and test for model fitting. Store them in X_train,X_test,y_train,y_test accordingly

- Initialise a sklearn_crfsuite.CRF model called crf and fit X_train,y_train.

- Use crf to predict from X_test and store the predicted values in y_pred

- Use metrics.flat_classification_report to see the entire classification report between y_test and y_pred

In [2]:
#import required modules
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
import sklearn_crfsuite
from sklearn_crfsuite import metrics
import eli5
import joblib


In [79]:
# To view output of all cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
path = '../input/ReutersNERDataset.csv'

In [4]:
# Read the file with the given encoding and do not throw any error, ignore it.
df = pd.read_csv(path,encoding = "ISO-8859-1",error_bad_lines=False)

In [5]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [7]:
df.isna().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

In [10]:
# filling NA values
df['Sentence #'].fillna(method='ffill', inplace=True)

In [13]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


In [16]:
df.nunique()

Sentence #    47959
Word          35178
POS              42
Tag              17
dtype: int64

In [25]:
df['sentenceNum'] = df['Sentence #'].str.split(':').apply(lambda x : x[1])

In [30]:
df['sentenceNum'] = df['sentenceNum'].astype('int')

In [31]:
df.dtypes

Sentence #     object
Word           object
POS            object
Tag            object
sentenceNum     int64
dtype: object

In [32]:
df.groupby('sentenceNum').agg({'Word' : 'count', 'POS' : 'count', 'Tag' : 'count'})

Unnamed: 0_level_0,Word,POS,Tag
sentenceNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,24,24,24
2,30,30,30
3,14,14,14
4,15,15,15
5,25,25,25
...,...,...,...
47955,20,20,20
47956,24,24,24
47957,11,11,11
47958,11,11,11


In [75]:
class Sentence(object):
    """Class for converting rows of words into sentence.
    Class has 3 attributes
    - data : stores the dataframe
    - grouped : tuple of word, pos and tag for each sentence in dataframe form
    - sent : list of list of tuple of word, pos and tag for each sentence

    Args:
        object ([pandas dataframe]): [dataframe having words, its postag and NER of sentence as rows]
    """
    
    data = None
    sent = None
    grouped = None
    def __init__(self, data):
        self.data = data
        # Take the data, extract out the word, part of speech associated and the Tag assigned and convert it
        # into a list of tuples.
        list_vals = lambda row: [(word, pos, tag) for word, pos, tag in list(zip(row['Word'],row['POS'],row['Tag']))]
        # Group the collected values according to the Sentence # column in the dataframe so that all the words
        # in a sentence are gouped together
        self.grouped = self.data.groupby('Sentence #').apply(list_vals)
        
        #Add the rows to the 'sent' list.
        self.sent = [row for row in self.grouped]

In [34]:
sObject = Sentence(df)

In [40]:
sentences = sObject.sent

In [73]:
sObject.grouped

Sentence #
Sentence: 1        [(Thousands, NNS, O), (of, IN, O), (demonstrat...
Sentence: 10       [(Iranian, JJ, B-gpe), (officials, NNS, O), (s...
Sentence: 100      [(Helicopter, NN, O), (gunships, NNS, O), (Sat...
Sentence: 1000     [(They, PRP, O), (left, VBD, O), (after, IN, O...
Sentence: 10000    [(U.N., NNP, B-geo), (relief, NN, O), (coordin...
                                         ...                        
Sentence: 9995     [(Opposition, NNP, O), (leader, NN, O), (Mir, ...
Sentence: 9996     [(On, IN, O), (Thursday, NNP, B-tim), (,, ,, O...
Sentence: 9997     [(Following, VBG, O), (Iran, NNP, B-geo), ('s,...
Sentence: 9998     [(Since, IN, O), (then, RB, O), (,, ,, O), (au...
Sentence: 9999     [(The, DT, O), (United, NNP, B-org), (Nations,...
Length: 47959, dtype: object

In [38]:
print(sentences[ : 2])

[[('Thousands', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('demonstrators', 'NNS', 'O'),
  ('have', 'VBP', 'O'),
  ('marched', 'VBN', 'O'),
  ('through', 'IN', 'O'),
  ('London', 'NNP', 'B-geo'),
  ('to', 'TO', 'O'),
  ('protest', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('war', 'NN', 'O'),
  ('in', 'IN', 'O'),
  ('Iraq', 'NNP', 'B-geo'),
  ('and', 'CC', 'O'),
  ('demand', 'VB', 'O'),
  ('the', 'DT', 'O'),
  ('withdrawal', 'NN', 'O'),
  ('of', 'IN', 'O'),
  ('British', 'JJ', 'B-gpe'),
  ('troops', 'NNS', 'O'),
  ('from', 'IN', 'O'),
  ('that', 'DT', 'O'),
  ('country', 'NN', 'O'),
  ('.', '.', 'O')],
 [('Iranian', 'JJ', 'B-gpe'),
  ('officials', 'NNS', 'O'),
  ('say', 'VBP', 'O'),
  ('they', 'PRP', 'O'),
  ('expect', 'VBP', 'O'),
  ('to', 'TO', 'O'),
  ('get', 'VB', 'O'),
  ('access', 'NN', 'O'),
  ('to', 'TO', 'O'),
  ('sealed', 'JJ', 'O'),
  ('sensitive', 'JJ', 'O'),
  ('parts', 'NNS', 'O'),
  ('of', 'IN', 'O'),
  ('the', 'DT', 'O'),
  ('plant', 'NN', 'O'),
  ('Wednesday', 'NNP', 'B-tim'),
  ('

In [76]:
def word2features(sent, i):
    """Function to create features that would be compatible with sklearn-crf package input definations.
    The inpuit to the api is a set of feature object which consists of the following features:
        - Whether or not the word is in lower case
        - The adjacent words to the word.
        - Where or not the word is in upper case.
        - Whether or not the word is a title or is a heading in the text.
        - If the word consist of digits only.
        - The POS tags of the word.
        - The POS tags of the adjacent words.

    Args:
        sent ([string]): [sentence for which feature needs to be created]
        i ([int]): [current row pointer]

    Returns:
        [dict]: [dictionary of features compatible with sklearn_crf API]
    """
    
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

In [77]:
def sent2features(sent):
    """function to get the feature dict for the sentences

    Args:
        sent ([string]): [sentence]

    Returns:
        [list]: [list of features for CRF]
    """
    return [word2features(sent, i) for i in range(len(sent))]

In [78]:

def sent2labels(sent):
    """Function to get labels for training CRF model

    Args:
        sent ([string]): [sentence]

    Returns:
        [list]: [list of NER tag for corresponding sentence]
    """
    return [label for token, postag, label in sent]

In [44]:

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [48]:
# Creating features for CRF model
X = [sent2features(sentence) for sentence in sentences]

In [58]:
# Create labels for CRF model
y = [sent2labels(sentence) for sentence in sentences]

In [61]:
# creating train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [63]:
# fit a CRF model and predict labels
crf = sklearn_crfsuite.CRF(verbose=True)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)

CRF(keep_tempfiles=None)

In [66]:
# print classification metrics
print(metrics.flat_classification_report(y_test, y_pred))

precision    recall  f1-score   support

       B-art       0.71      0.06      0.12        77
       B-eve       0.63      0.38      0.48        50
       B-geo       0.85      0.91      0.88      7428
       B-gpe       0.96      0.93      0.95      3137
       B-nat       0.57      0.29      0.38        42
       B-org       0.82      0.72      0.77      3967
       B-per       0.85      0.83      0.84      3411
       B-tim       0.93      0.87      0.90      4029
       I-art       0.50      0.03      0.06        58
       I-eve       0.60      0.29      0.39        41
       I-geo       0.83      0.81      0.82      1486
       I-gpe       0.92      0.52      0.67        46
       I-nat       0.00      0.00      0.00        11
       I-org       0.83      0.80      0.81      3319
       I-per       0.85      0.91      0.88      3483
       I-tim       0.86      0.75      0.80      1287
           O       0.99      0.99      0.99    176651

    accuracy                           0

In [67]:
metrics.flat_f1_score(y_test, y_pred, average='weighted')

0.9713529942001111

In [69]:
eli5.show_weights(crf, top=10)

Using TensorFlow backend.


From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,5.24,2.366,0.0,2.722,0.0,3.887,0.0,3.396,0.0,2.342,0.0,4.207,0.0,5.701,0.0,4.407,0.0
B-art,-1.262,0.0,5.699,0.0,0.0,-0.215,0.0,0.0,0.0,0.0,0.0,0.199,0.0,-0.381,0.0,-0.038,0.0
I-art,-1.44,0.0,5.306,0.0,0.0,-0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.148,0.0,-0.498,0.0
B-eve,-1.314,0.0,0.0,0.0,5.355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215,0.0
I-eve,-0.727,0.0,0.0,0.0,5.078,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.537,0.0
B-geo,0.342,0.406,0.0,0.061,0.0,0.0,8.289,1.693,0.0,0.0,0.0,0.817,0.0,1.307,0.0,2.22,0.0
I-geo,-0.891,0.464,0.0,0.0,0.0,0.0,6.495,-0.059,0.0,0.0,0.0,0.508,0.0,0.893,0.0,0.567,0.0
B-gpe,1.353,-0.502,0.0,-0.187,0.0,1.495,0.0,0.0,5.348,0.0,0.0,2.65,0.0,2.317,0.0,1.02,0.0
I-gpe,-0.828,0.0,0.0,0.0,0.0,-0.444,0.0,0.0,2.785,0.0,0.0,0.0,0.0,0.373,0.0,0.0,0.0
B-nat,-1.382,0.0,0.0,0.0,0.0,0.079,0.0,0.0,0.0,0.0,4.123,0.0,0.0,-0.132,0.0,-0.316,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+4.795,bias,,,,,,,,,,,,,,,
+4.526,word.lower():month,,,,,,,,,,,,,,,
+4.200,word.lower():last,,,,,,,,,,,,,,,
+3.911,word.lower():year,,,,,,,,,,,,,,,
+3.741,BOS,,,,,,,,,,,,,,,
+3.247,word.lower():hurricane,,,,,,,,,,,,,,,
+3.168,word.lower():jewish,,,,,,,,,,,,,,,
… 65258 more positive …,… 65258 more positive …,,,,,,,,,,,,,,,
… 8817 more negative …,… 8817 more negative …,,,,,,,,,,,,,,,
-3.358,word.istitle(),,,,,,,,,,,,,,,

Weight?,Feature
+4.795,bias
+4.526,word.lower():month
+4.200,word.lower():last
+3.911,word.lower():year
+3.741,BOS
+3.247,word.lower():hurricane
+3.168,word.lower():jewish
… 65258 more positive …,… 65258 more positive …
… 8817 more negative …,… 8817 more negative …
-3.358,word.istitle()

Weight?,Feature
+2.328,word.lower():twitter
+1.926,word.lower():english
+1.392,word[-3:]:ish
+1.370,-1:word.lower():film
+1.345,word.lower():gdp
+1.342,word[-3:]:GDP
+1.335,word.lower():spaceshipone
+1.322,word.lower():canal
+1.313,word.lower():facebook
+1.309,word[-3:]:One

Weight?,Feature
+1.046,-1:word.istitle()
+0.868,-1:word.lower():boeing
+0.845,-1:word.isupper()
+0.800,word.lower():flowers
+0.762,word[-3:]:und
+0.737,word[-2:]:nd
+0.723,word.lower():station
+0.678,word[-2:]:um
+0.676,word.isdigit()
+0.664,word[-3:]:ium

Weight?,Feature
+2.016,word[-3:]:II
+2.016,word.lower():ii
+1.969,word[-2:]:II
+1.776,-1:word.lower():war
+1.762,word.lower():ramadan
+1.415,word.lower():olympic
+1.408,word[-3:]:pic
+1.301,+1:word.lower():open
+1.267,+1:word.lower():war
+1.158,word.lower():i

Weight?,Feature
+1.149,word[-3:]:Day
+1.148,word.lower():day
+1.121,word.lower():open
+1.121,word[-3:]:pen
+1.112,-1:word.istitle()
+1.112,word.lower():games
+1.066,-1:word.lower():hurricane
+0.989,-1:word.lower():war
+0.907,-1:word.lower():typhoon
+0.845,word[-3:]:mes

Weight?,Feature
+3.407,word.lower():beijing
+3.004,word.lower():iran
+2.913,-1:word.lower():mr.
+2.791,word.lower():washington
+2.786,word.lower():israel
+2.597,word.lower():china
+2.585,word.lower():caribbean
+2.524,word.lower():france
+2.403,word.lower():britain
+2.329,word.lower():martian

Weight?,Feature
+1.831,word.lower():city
+1.748,word.lower():republic
+1.742,-1:word.lower():united
+1.621,word.lower():east
+1.553,word.lower():airport
+1.477,word.lower():island
+1.399,-1:word.lower():of
+1.364,word[-3:]:ast
+1.334,word.lower():ocean
+1.306,-1:word.lower():gulf

Weight?,Feature
+3.945,word.lower():niger
+3.813,word.lower():nepal
+3.810,word.istitle()
+3.161,word[-3:]:pal
+3.021,word.lower():afghan
+2.769,word.lower():jordan
+2.665,word.lower():korean
+2.618,postag:NNS
+2.514,word[-3:]:ger
… 2663 more positive …,… 2663 more positive …

Weight?,Feature
+2.126,+1:word.lower():mayor
+1.736,-1:word.lower():bosnian
+1.483,word.istitle()
+1.422,-1:postag:NNP
+1.301,postag:NNS
+1.265,word[-3:]:can
+1.075,word.lower():cypriots
+1.064,-1:word.lower():north
+1.025,-1:word.lower():united
… 232 more positive …,… 232 more positive …

Weight?,Feature
+3.457,word.lower():katrina
+2.139,word.lower():marburg
+2.128,word.lower():rita
+1.949,word[-3:]:ita
+1.856,word[-2:]:N1
+1.663,word[-3:]:urg
+1.622,word[-2:]:rg
+1.590,word.isupper()
+1.531,word[-3:]:5N1
+1.531,word.lower():h5n1

Weight?,Feature
+1.403,word.lower():rita
+1.396,word[-3:]:ita
+1.301,word[-2:]:ta
+1.077,word.lower():katrina
+1.026,-1:word.lower():hurricanes
+0.992,word.lower():flu
+0.982,word[-2:]:lu
+0.891,-1:word.lower():hurricane
+0.886,-1:word.istitle()
+0.861,word[-3:]:ina

Weight?,Feature
+4.145,word.lower():philippine
+3.849,word.lower():hamas
+3.678,word.lower():al-qaida
+2.804,word.lower():congress
+2.734,word.lower():hezbollah
+2.722,word.lower():taleban
+2.717,-1:word.lower():senator
+2.631,-1:word.lower():niger
+2.556,word.lower():reuters
+2.505,word.lower():european

Weight?,Feature
+2.084,word[-3:]:for
+1.849,word.lower():department
+1.766,-1:word.lower():european
+1.715,+1:word.lower():mr.
+1.660,-1:word.lower():u.s.
+1.572,+1:word.lower():mil
+1.499,+1:word.lower():post
… 7537 more positive …,… 7537 more positive …
… 1742 more negative …,… 1742 more negative …
-1.499,word.lower():city

Weight?,Feature
+4.798,word.lower():president
+3.737,BOS
+3.562,word.lower():prime
+3.031,word.lower():senator
+3.016,word.lower():obama
+2.807,word.lower():vice
+2.632,word[-2:]:r.
+2.587,word.lower():western
+2.371,word.lower():clinton
+2.085,word.lower():hall

Weight?,Feature
+1.529,word.lower():rice
+1.443,-1:word.lower():michael
+1.258,+1:word.lower():recep
+1.257,-1:word.lower():paul
+1.257,+1:word.lower():reports
+1.252,-1:word.lower():condoleezza
… 8052 more positive …,… 8052 more positive …
… 1628 more negative …,… 1628 more negative …
-1.309,word[-3:]:ion
-1.448,word[-3:]:day

Weight?,Feature
+4.496,word[-3:]:day
+3.326,-1:word.lower():week
+3.198,word[-3:]:Day
+3.160,word.lower():february
+3.114,word.lower():january
+3.020,word[-2:]:0s
+2.974,+1:word.lower():years
+2.959,word.lower():august
+2.815,word.lower():march
+2.778,+1:word.lower():weeks

Weight?,Feature
+3.134,word[-3:]:day
+2.087,word.lower():evening
+2.065,word.lower():morning
+2.015,word[-2:]:m.
+2.015,word[-3:]:.m.
+1.995,word[-2:]:ay
+1.927,word[-3:]:ber
+1.897,word.lower():afternoon
+1.879,-1:word.lower():past
+1.669,word.lower():january


In [71]:
# saving and loading a model
joblib.dump(crf, '../models/crf_model.joblib')
# joblib.load('../models/crf_model.joblib')

['../models/crf_model.joblib']