# Homework 5 – Analysis of distributed data sources

In the data set we already used in the exercise, there is another target attribute: age.
Complete the task group A **or** B. Always assess the performance of the corresponding classifier/regressor.

**Task group A**
1. Build and test a text classifier based on the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

2. Build a ML name classifier that classifies the age of a user according their age classes (0-10, 11-20, 21-30, 31+).

3. Build a meta classifier that combines the previously built classifiers based on their age classes (0-10, 11-20, 21-30, 31+).

**Task group B**
1. Build and test a text regressor based on the age of a user according their specific age (regression).

2. Build a ML name classifier that classifies the age of a user according their specific age (regression).

3. Build a meta classifier that combines the previously built classifiers based on their specific age (regression).

<i>**Please make sure:**

- each cell (essential step) is commented on with a short sentence
- new variables / fields are output in sufficient length (e.g., df.head (10))
- each of the tasks is answered with a short written statement

This makes the evaluation much easier and, thus, would help us a lot.</i>


In [31]:
#1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB as bayes
from sklearn.feature_extraction.text import CountVectorizer as countvec
from sklearn.metrics import accuracy_score as accuracy
from sklearn.ensemble import RandomForestClassifier 

#reading data
data = pd.read_pickle('data/twitterData.pkl')
# ignore rows with empty tweets for our text classifier and ignor rows with no value(numbers) for age-column
data = data[data['tweets_concatenated'] != '']
data = data[np.isfinite(data['age'])]

# change target column for classifier.  use a binning
bins = [0, 10, 20, 30, 100]
labels = [0,23,92,10]
data['age_binned'] = pd.cut(data['age'], bins, labels=labels)

# create three train splits for all three classifier
trainSub, tempData = train_test_split(data, test_size=0.4)
trainMeta, test = train_test_split(tempData, test_size=0.4)

# only using tweets for text classifier
trainSub_tweets = trainSub['tweets_concatenated']
trainMeta_tweets = trainMeta['tweets_concatenated']
test_tweets = test['tweets_concatenated']


# use column age_binned to 
y_trainSub = trainSub['age_binned']
y_trainMeta = trainMeta['age_binned']
y_test = test['age_binned']

#use countvectorize to transform tweets
countvectorizer_tweets = countvec()
x_trainSub_tweets = countvectorizer_tweets.fit_transform(trainSub_tweets)
x_trainMeta_tweets = countvectorizer_tweets.transform(trainMeta_tweets)
x_test_tweets = countvectorizer_tweets.transform(test_tweets)

#  using fMultinomialNB model to train and predict
bayes_tweets = bayes()
bayes_tweets.fit(x_trainSub_tweets, y_trainSub)
tweet_score = bayes_tweets.score(x_test_tweets, y_test)
tweetScore_text = "Tweet Score is {:0.2%}".format(tweet_score)
print(tweetScore_text)

# save prediction for meta classifier in task3
stacked_input1 = pd.Series(bayes_tweets.predict(x_trainMeta_tweets))
stacked_input1_test = pd.Series(bayes_tweets.predict(x_test_tweets))



Tweet Score is 57.14%


In [33]:
#2

# getting the data from  name-column ( train and sub + meta for task 3)
trainSub_names = trainSub['name']
trainMeta_names = trainMeta['name']
test_names = test['name']

# using countvectorizer again to transform names
cvectorizer_names = countvec()
x_trainSub_names = cvectorizer_names.fit_transform(trainSub_names)

x_trainMeta_names = cvectorizer_names.transform(trainMeta_names)
x_test_names = cvectorizer_names.transform(test_names)

# using fMultinomialNB model to train and predict agian
bayes_names = bayes()
bayes_names.fit(x_trainSub_names, y_trainSub)


nameScore2 = bayes_names.score(x_test_names, y_test)
nameScore_text2 = "Name Score is {:0.2%}".format(nameScore2)
print(nameScore_text2)

# save prediction for meta classifier in task3
stacked_input2 = pd.Series(bayes_names.predict(x_trainMeta_names))
stacked_input2_test = pd.Series(bayes_names.predict(x_test_names))

Name Score is 58.79%


In [35]:
#3

# initialize RF classifier
forest = RandomForestClassifier()
# build a pandas df for training and one for testing
# meta training data
meta_data_train = {'input1': stacked_input1, 'input2': stacked_input2}
meta_data_train = pd.DataFrame(meta_data_train)

meta_data_test = {'input1': stacked_input1_test, 'input2': stacked_input2_test}
meta_data_test = pd.DataFrame(meta_data_test)

# using
forest.fit(meta_data_train, y_trainMeta)

metaScore = forest.score(meta_data_test, y_test)
metaScore_text = "Meta Score is {:0.2%}".format(metaScore)

print(metaScore_text)

Meta Score is 59.34%


