## Part 1: Entropy

In [1]:
import pandas as pd
import numpy as np

In [2]:
from math import log

1. So my method behind this is using the following: 

    1. Compute entropy and info gain for each attribute.
    2. Sort the attributes in ascending order w.r.t. entropy (descending order for info gain)
    3. While attribute list is not empty or length(attribute list)>threshold, select the first attribute to be the root of the tree/subtree. 
    4. Remove the attribute from the sorted list.
    5. Goto step 3

In [3]:
d = {'HasJob': [1,1,1,0,0,0,1,1], 'HasFamily': [1,1,0,1,0,1,0,0], 
     'IsAbove30years' : [1,1,1,0,1,0,1,1], 'Defaulter' : [0,0,0,0,1,1,1,1]}

df = pd.DataFrame(data=d)
print(df)

   Defaulter  HasFamily  HasJob  IsAbove30years
0          0          1       1               1
1          0          1       1               1
2          0          0       1               1
3          0          1       0               0
4          1          0       0               1
5          1          1       0               0
6          1          0       1               1
7          1          0       1               1


In [4]:
#define feature selection
def root_error(label):
    error = 0
    for i in range(len(df)):
        if df.get(label)[i] == 1:
            error += 1
    return error / len(df)

def classification_error(feature, label):
    errors = 0
    for i in range(len(df)):
        if df.get(feature)[i] == df.get(label)[i]:
            errors += 1
    return errors / len(df)

print("Root Error:", root_error('Defaulter'))
print("HasJob Split:", classification_error('HasJob', 'Defaulter'))
print("HasFamily Split:", classification_error('HasFamily', 'Defaulter'))
print("IsAbove30years Split:", classification_error('IsAbove30years', 'Defaulter'))

Root Error: 0.5
HasJob Split: 0.375
HasFamily Split: 0.25
IsAbove30years Split: 0.5


From this information, HasFamily is the best feature as our first split because it provides the lowest error rating.

In [54]:
#Entropy and Source Coding Theorem
# probs = [0.7, 0.2, 0.1]

a = 0.7
b = 0.2
c = 0.1

entropy = (-a) * np.log(a) + (-b) * np.log(b) + (-c) * np.log(c)


print(entropy)


0.8018185525433372


According to Shannon's Source Coding Theorem, the entropy of 0.8 means that if we were to compress information given about these probabilities, we would need 0.8 bits to do so. If we were to use any less bits, we will lose information, but if we use more bits, we will not lose any information.

## Part 2: Natural Language Processing:

1.
The bag-of-words model is a model of representing textual data with machine learning algorithms. Simply put, it is a way of extracting features from text for use in modeling. The approach starts with a vocabulary of known words and a measure of the presence of known words. That is where the “bag” part of the name comes from. Any other information about the order or structure of words in the document is discarded when doing analysis. The limitations of bag-of-words comes with the vocabulary. The vocabulary requires careful design in order to measure the size (which can affect the document’s representation). Sparse representations are harder to model. When words are discarded, it can change the context of the original document.

Word2vec uses two-layer neural networks that are trained to reconstruct linguistic contexts of words. It uses a large corpus of text as its input and produces a vector space with hundreds of dimensions having assigned each unique word to a space within the dimensions. Word2vec utilizes two major models architectures to produce a distributed representation of words: continuous bag-of-words, or continuous skip-gram. Continuous bag-of-words has the model predict the current word from a window of surrounding context words. Continuous skip-gram uses the current word to predict the surrounding window of context words.


2.

Word vector: the encoding of a given word is set to 1 where all other elements are 0.
Word embedding: word embeddings are the texts converted into numbers.

Word embedding of a word depends on the way the dictionary is prepared because in real world applications, we might have a corpus which contains millions of documents with millions of unique words. 

3 . 

Corpus in NLP: a corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based.

In [59]:
import bs4 as bs
import nltk
from nltk.tokenize import sent_tokenize # tokenizes sentences
#  >>> import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/William/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/William/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [60]:
def review_cleaner(review,lemmatize=True,stem=False):
    '''
    Clean and preprocess a review.

    1. Remove HTML tags
    2. Use regex to remove all special characters (only keep letters)
    3. Make strings to lower case and tokenize / word split reviews
    4. Remove English stopwords
    5. Rejoin to one string
    '''
    ps = PorterStemmer()
    wnl = WordNetLemmatizer()

    #2. Remove punctuation
    review = re.sub("[^a-zA-Z]", " ",review)
    
    #3. Tokenize into words (all lower case)
    review = review.lower().split()
    
    #4.Set stopwords
    eng_stopwords = set(stopwords.words("english"))

    clean_review=[]
    for word in review:
        if word not in eng_stopwords:
            if lemmatize is True:
                word=wnl.lemmatize(word)
            elif stem is True:
                if word == 'oed':
                    continue
                word=ps.stem(word)
            clean_review.append(word)
    return(clean_review)

In [61]:
train = pd.read_csv('prideNprejudice.csv')
train.head()

Unnamed: 0,sentences
0,"It is a truth universally acknowledged, that a..."
1,"""My dear Mr. Bennet,"" said his lady to him one..."
2,Bennet replied that he had not.
3,"""But it is,"" returned she; ""for Mrs. Long has ..."
4,Bennet made no answer.


In [62]:
n = len(train['sentences'])

sentences = []

for i in range(n):
    sentences.append(review_cleaner(train['sentences'][i]))


In [64]:
# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # ignore all words with total frequency lower than this                       
num_workers = 4       # Number of threads to run in parallel
context = 10         # Context window size                                                                                    


# Initialize and train the model (this will take some time)
from gensim.models import word2vec

print("Training word2vec model... ")
model = word2vec.Word2Vec(sentences, workers=num_workers, \
           size=num_features, min_count = min_word_count, \
            window = context, iter=28)


# save the model for later use. You can load it later using Word2Vec.load()
model_name = "300features_40minwords_10context"
model.save(model_name)

Training word2vec model... 


In [65]:
from sklearn import decomposition
import matplotlib.pyplot as plt

vocab_tmp = list(model.wv.vocab)
X = model[vocab_tmp]
# get two principle components of the feature space
pca= decomposition.PCA(n_components=2).fit_transform(X)

# set figure settings
plt.figure(figsize=(10,10),dpi=100)

# save pca values and vocab in dataframe df
df = pd.concat([pd.DataFrame(pca),pd.Series(vocab_tmp)],axis=1)
df.columns = ['x', 'y', 'word']



plt.xlabel("Ist principal component")
plt.ylabel('2nd principal component')


plt.scatter(x=pca[:, 0], y=pca[:, 1],s=3)
for i, word in enumerate(df['word'][0:100]):
    plt.annotate(word, (df['x'].iloc[i], df['y'].iloc[i]))
plt.title("PCA Embedding")
plt.show()

  """


<matplotlib.figure.Figure at 0x116d89160>

In [66]:
print("Similarity (elizabeth, girl): ", model.similarity('elizabeth','girl'))
print("Similarity (family, lydia): ",model.similarity('family','lydia'))
print("Similarity (sir, lady): ",model.similarity('sir','lady'))
print("Similarity (william, enough): ",model.similarity('william','enough'))
print("Similarity (may, sister): ",model.similarity('may','sister'))

Similarity (elizabeth, girl):  0.6491067196921319
Similarity (family, lydia):  0.849229730431287
Similarity (sir, lady):  0.7474620225046102
Similarity (william, enough):  0.1977049622767834
Similarity (may, sister):  0.5857893911449343


  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """


In [67]:
vocab_tmp = list(model.wv.vocab)
print('Vocab length:',len(vocab_tmp))

Vocab length: 234


In [70]:
len(model['girl'])

  """Entry point for launching an IPython kernel.


300

### 4. word2vec model for Pride and Prejudice

## Part 3: SQL

In [22]:
import sqlite3

In [23]:
connection = sqlite3.connect('dogs.db')
cursor = connection.cursor()

In [24]:
cursor.execute('DROP TABLE IF EXISTS parents')
sql_command = """CREATE TABLE parents (
    parent VARCHAR(20),
    child VARCHAR(20));"""
cursor.execute(sql_command)

<sqlite3.Cursor at 0x10cb209d0>

In [25]:
sql_command = """INSERT INTO parents (parent, child)
    VALUES ("abraham", "barack") UNION
    VALUES ("abraham", "clinton") UNION
    VALUES ("delano", "herbert") UNION
    VALUES ("fillmore", "abraham") UNION
    VALUES ("fillmore", "delano") UNION
    VALUES ("fillmore", "grover") UNION
    VALUES ("eisenhower", "fillmore");"""
cursor.execute(sql_command)

<sqlite3.Cursor at 0x10cb209d0>

In [26]:
connection.commit()

In [27]:
connection.close()

### 1. Simple Selects on Parent Table

In [31]:
connection = sqlite3.connect('dogs.db')
cursor = connection.cursor()

In [33]:
parents = cursor.execute('SELECT * from parents')
for parent in parents.fetchall(): 
    print(parent)

('abraham', 'barack')
('abraham', 'clinton')
('delano', 'herbert')
('eisenhower', 'fillmore')
('fillmore', 'abraham')
('fillmore', 'delano')
('fillmore', 'grover')


### 2. Select child and parent where abraham is the parent

In [34]:
abe = cursor.execute('SELECT child, parent '
                   'FROM parents '
                   'WHERE parent = "abraham"')
for row in abe.fetchall(): 
    print(row)

('barack', 'abraham')
('clinton', 'abraham')


## 2. Joins

In [35]:
#setup
cursor.execute('DROP TABLE IF EXISTS dogs')
sql_command = """CREATE TABLE dogs AS
 SELECT "abraham" AS name, "long" AS fur UNION
 SELECT "barack", "short" UNION
 SELECT "clinton", "long" UNION
 SELECT "delano", "long" UNION
 SELECT "eisenhower", "short" UNION
 SELECT "fillmore", "curly" UNION
 SELECT "grover", "short" UNION
 SELECT "herbert", "curly";"""
cursor.execute(sql_command)
connection.commit()
connection.close()

In [36]:
connection = sqlite3.connect('dogs.db')
cursor = connection.cursor()
fur = cursor.execute('SELECT COUNT (*) '
                   'FROM dogs '
                   'WHERE fur = "short"')
for row in fur.fetchall(): 
    print(row)

(3,)


In [37]:
curly = cursor.execute('SELECT parent '
                   'FROM parents '
                   'JOIN dogs ON parents.child = dogs.name '
                   'WHERE dogs.fur = "curly"')
for row in curly.fetchall(): 
    print(row)

('eisenhower',)
('delano',)


In [38]:
furtype = cursor.execute('SELECT parent, child '
                   'FROM parents '
                   'JOIN dogs a ON a.name = parents.child '
                   'JOIN dogs b ON b.name = parents.parent '
                   'WHERE a.fur = b.fur'
                   )
for row in furtype.fetchall(): 
    print(row)

('abraham', 'clinton')


In [39]:
connection.commit()
connection.close()

## 3. Aggregate functions, numerical logic and grouping

In [40]:
connection = sqlite3.connect('dogs.db')
cursor = connection.cursor()

In [41]:
#setup
cursor.execute('DROP TABLE IF EXISTS animals')
sql_command = """create table animals as
 select "dog" as kind, 4 as legs, 20 as weight union
 select "cat" , 4 , 10 union
 select "ferret" , 4 , 10 union
 select "parrot" , 2 , 6 union
 select "penguin" , 2 , 10 union
select "t-rex" , 2 , 12000;"""
cursor.execute(sql_command)

<sqlite3.Cursor at 0x10cc33c00>

In [42]:
connection.commit()
connection.close()

In [43]:
connection = sqlite3.connect('dogs.db')
cursor = connection.cursor()

In [45]:
weight = cursor.execute('SELECT kind, MIN(weight) '
                   'FROM animals ')
for row in weight.fetchall(): 
    print(row)

('parrot', 6)


In [46]:
avgweight = cursor.execute('SELECT AVG(legs), AVG(weight) '
                   'FROM animals ')
for row in avgweight.fetchall(): 
    print(row)

(3.0, 2009.3333333333333)


In [47]:
legs = cursor.execute('SELECT kind, weight, legs '
                   'FROM animals '
                   'WHERE legs > 2 '
                   'AND weight < 20')
for row in legs.fetchall(): 
    print(row)

('cat', 10, 4)
('ferret', 10, 4)


In [48]:
weight = cursor.execute('SELECT AVG(weight) '
                   'FROM animals '
                   'GROUP BY legs')
for row in weight.fetchall(): 
    print(row)

(4005.3333333333335,)
(13.333333333333334,)
