
**Wstęp**

Analizujemy dane dotyczące użytkowników Twittera.

*Co udało się zrobić?*


1.     Popracować z różnymi klasyfikatorami i przygotować sobie metodę do sprawnej analizy: 
         
             train_model (classifier, X_train, y_train,X_test, y_test):

Gdzie:

     class classifier:

          def fit
              ...

          def predict
              ...

          def  predict_proba
               ...


Przyklady:

* MLP Multi-layer Perceptron classifier
* KNeighborsClassifier
* Random Forests


2.  Przeprowadzić analizę tekstu

 * cleaning
 * Stemming
 * stopwords
 * wektoryzacja (unigramy, n-gramy)
 * MultinomialNB()
 * TF-IDF  Term frequency - Inverse document frequency

3.  Modularne Sieci Neuronowe - Modular neural network


*      voting strategy
*   myEnseblePredict(estimators, data, voting="Fuzzy voting")


4. Ocena modelu



*   Null accuracy









**Ładowanie bibliotek**

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib
import matplotlib.pyplot as plt
matplotlib.get_backend()
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from sklearn import decomposition, ensemble
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import  precision_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # for bag of words
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import normalize
import operator
from functools import reduce



**Ładowanie danych**

Wszystkie dane z podsumowaniem.

In [None]:
metadata = pd.read_csv('../input/gender-classifier-DFE-791531.csv', encoding='latin1')

print(metadata.info())


**Wybieramy cechy do analizy**

Wybieramy do analizy kolumny:

|Kolumna | Opis|
|--|--|
|  _golden   |                       czy klient ma konto gold|
| _trusted_judgments  |           number of trusted judgments|
|  gender:	        |                        one of male, female, or brand (for non-human profiles)|
|  gender_confidence	  |      a float representing confidence in the provided gender|
|  profile_yn:confidence |      confidence in the existence/non-existence of the profile|
|retweet: 	      |                number of times the user has retweeted (or possibly, been retweeted)| 
|  text:    	    |                    text of a random one of the user's tweets|
|  tweet_count:       |      	 number of tweets that the user has posted|
|  description: 	  |           the user's profile description|
|  fav number:       |       	 number of tweets the user has favorited |
| link_color| |
| sidebar_color| | 


In [None]:
data_row = pd.read_csv('../input/gender-classifier-DFE-791531.csv',usecols= [1,3,5,6,8,10,11,13,17,18,19,21],encoding='latin1')
display(data_row.head(10))

**Czyszczenie danych**

Usuwamy elementy o nieznanym gender i gender:confidence wiekszym niz 90%.


In [None]:

data = data_row.where((data_row['gender:confidence'] > 0.9) & (data_row['gender'] != 'unknown'))




Zmieniamy etykiety na liczby:

'male'    $ \Rightarrow$ 0, 

'female'    $ \Rightarrow$ 1, 

brand'    $ \Rightarrow$ 2

In [None]:

print("Zmieniam gender: {'male':0, 'female':1, 'brand':2}")
print(data.gender.head(5))
data['gender_label'] = data.gender.map({'male':0, 'female':1, 'brand':2})
print(data.gender_label.head(5))

**Kolory**



Będziemy dokonywać predykcji gender na podstawie dwu kolorów:

    koloru panelu bocznego

    koloru linku

Rozpoczynamy od przekształcenia koloru w wersji #RRGGBB  na  (r,g,b)

In [None]:
#------------------------Kolory-------------------------

def hexToRGB(color):
    if color == '0':
        return 255,255,255
    if len(color)<5:
        return None, None, None
    try:
        r=int(color[0:2],16)
        g=int(color[2:4],16)
        b=int(color[4:6],16)
    except (RuntimeError, TypeError, NameError, ValueError):
        return None, None, None
    else:
        return r,g,b
    
    

print("Zmeniam kolory z postaci #RRGGBB w wersji hex na (rr,gg,bb) w wersji dec")
print(data["link_color"].head(5))
data["rl"] = data["link_color"].apply(lambda x: hexToRGB(str(x))[0])
data["gl"] = data["link_color"].apply(lambda x: hexToRGB(str(x))[1])
data["bl"] = data["link_color"].apply(lambda x: hexToRGB(str(x))[2])
data["rs"] = data["sidebar_color"].apply(lambda x: hexToRGB(str(x))[0])
data["gs"] = data["sidebar_color"].apply(lambda x: hexToRGB(str(x))[1])
data["bs"] = data["sidebar_color"].apply(lambda x: hexToRGB(str(x))[2])
print(data[['rl', 'gl','bl']].head(5))
data.dropna(inplace=True,axis=0)



male_top_sidebar_color = data[data['gender'] == 'male']['sidebar_color'].value_counts().head(10)
male_top_sidebar_color_idx = male_top_sidebar_color.index
male_top_color = male_top_sidebar_color_idx.values

male_top_color[1] = '000000'
print (male_top_color)
l = lambda x: '#'+x

sns.set_style("darkgrid", {"axes.facecolor": "#F5ABB5"})
plot3 = sns.barplot (x = male_top_sidebar_color, y = male_top_color, palette=list(map(l, male_top_color)))
fig3 = plot3.get_figure()
fig3.savefig("male_colours.jpg")


In [None]:
female_top_sidebar_color = data[data['gender'] == 'female']['sidebar_color'].value_counts().head(10)
female_top_sidebar_color_idx = female_top_sidebar_color.index
female_top_color = female_top_sidebar_color_idx.values

female_top_color[2] = '000000'
print (female_top_color)
l = lambda x: '#'+x

sns.set_style("darkgrid", {"axes.facecolor": "#F5ABB5"})
plot4 =sns.barplot (x = female_top_sidebar_color, y = female_top_color, palette=list(map(l, female_top_color)))
sns.set_style("darkgrid", {"axes.facecolor": "#FFFFFF"})
fig4 = plot4.get_figure()
fig4.savefig("female_colours.jpg")

Definuję funkcję train_model, która będzie za mnie robić uczenie.

In [None]:

def train_model(classifier, feature_vector_train, label, feature_vector_valid,valid_y, is_neural_net=False):
    """Model training
    Keyword arguments:
    classifier -- Klasyfikator posiadajacy metody fit,predict, predict_proba
    feature_vector_train -- X_train
    label -- Y_train
    feature_vector_valid -- x_val
    valid_y -- y_val
    Returns:
    accuracy -- dokładność dopasowania,
    predykcja_prawdopodobienstwa -- prawdopodobieństwo dopasowania dla każdej klasy,
    predykcja -- numerklasy najbardziej prawdopodobnej,
    przetrenowany klasyfikator."""
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    #only for neural networks
    if is_neural_net:
        classifier.fit(feature_vector_train, label,epochs = 2)
        predictions = predictions.argmax(axis=-1)

    return metrics.accuracy_score(predictions, valid_y),  classifier.predict_proba(feature_vector_valid), precision_score(y_test, classifier.predict(feature_vector_valid),average=None), classifier






Podział na dane testowe i uczące.

15% danych do testow
85% do uczenia

In [None]:
test_percentage = 0.15
X_train, X_test, y_train, y_test = train_test_split(data[['rl','gl','bl','rs','gs','bs']], data['gender_label'], test_size=test_percentage, random_state=20)

Model Perceptronowy Multi-layer Perceptron classifier.

Jedna warstwa HIDDEN.

Próbujemy różnej liczby perceptronów za wzorem:

    (no of inputs + no of outputs)^0.5 + (1 to 20)

In [None]:

print("Szukam najlepszej liczby perceptronow miedzy 10 a 19 za wzorem:")
hidden_min = 4
hidden_layer_size = 20
print("(no of inputs + no of outputs)^0.5 + (1 to 10)")
mlp_accuracy = [0]*(hidden_min+hidden_layer_size)
for i in range(hidden_min,hidden_min+hidden_layer_size):
    accuracy, Color_predictions, color_precision, color_classifier = train_model(MLPClassifier(hidden_layer_sizes=(i), early_stopping=True, random_state = np.random.RandomState(47)),
                                                               X_train, y_train, X_test, y_test)
    mlp_accuracy[i] = accuracy
    print("N iter= ", color_classifier.n_iter_ )

fig5, ax = plt.subplots()
ax.plot(range(hidden_min,hidden_min+hidden_layer_size), mlp_accuracy[hidden_min:hidden_layer_size+hidden_layer_size], 'o')
plt.ylabel('accuracy')
plt.xlabel('n_perceptrons')
plt.title('MLPClassifier accuracy')
plt.show()  

fig5.savefig("mlpclassifier_accuracy.jpg")

accuracy, Color_predictions,color_precision, color_classifier = train_model(MLPClassifier(hidden_layer_sizes=(np.argmax(mlp_accuracy)), early_stopping=True, random_state = np.random.RandomState(47)), X_train, y_train, X_test,y_test)
print("MLPClassifier(hidden_layer_sizes={0}) Color accuracy: {1}  ".format(np.argmax(mlp_accuracy),accuracy))

In [None]:
hidden_layer_size = 20
print("Szukam najlepszej liczby perceptronow w dwu warstwach miedzy 10 a ", hidden_layer_size, " za wzorem:")

print("(no of inputs + no of outputs)^0.5 + (1 to ", hidden_layer_size,")")
mlp_accuracy = [0]*(hidden_layer_size+hidden_layer_size)
for i in range(hidden_layer_size,hidden_layer_size+hidden_layer_size):
    accuracy, Color_predictions, color_precision, color_classifier = train_model(MLPClassifier(hidden_layer_sizes=(i,i), early_stopping=True, learning_rate='invscaling', random_state = np.random.RandomState(47)),
                                                               X_train, y_train, X_test, y_test)
    mlp_accuracy[i] = accuracy
    print("N iter= ", color_classifier.n_iter_ )

fig5, ax = plt.subplots()
ax.plot(range(hidden_layer_size,hidden_layer_size+hidden_layer_size), mlp_accuracy[hidden_layer_size:hidden_layer_size+hidden_layer_size], 'o')
plt.ylabel('accuracy')
plt.xlabel('n_perceptrons')
plt.title('MLPClassifier accuracy')
plt.show()  

fig5.savefig("mlpclassifier_accuracy_2_layers.jpg")

accuracy, Color_predictions,color_precision, color_classifier = train_model(MLPClassifier(hidden_layer_sizes=(np.argmax(mlp_accuracy),np.argmax(mlp_accuracy)), early_stopping=True,learning_rate='invscaling', random_state = np.random.RandomState(47)), X_train, y_train, X_test,y_test)
print("MLPClassifier(hidden_layer_sizes={0}) Color accuracy: {1}  ".format(np.argmax(mlp_accuracy),accuracy))

Powtarzam zabawę dla metody Nearest Neighbours,

poszukuję najlepszej liczby sąsiadów między 1 a 20.

In [None]:
max_neighbours = 40
print("Szukam najlepszej liczby sąsiadów miedzy 1 a ", max_neighbours)

neighbours_accuracy = [0]*max_neighbours
for i in range(1,max_neighbours):
    accuracy, Color_predictions, color_precision, color_classifier = train_model(KNeighborsClassifier(i), X_train, y_train, X_test,y_test)
    neighbours_accuracy[i] = accuracy
    
fig6, ax = plt.subplots()
ax.plot(range(1,max_neighbours), neighbours_accuracy[1:max_neighbours], 'o')
plt.ylabel('accuracy')
plt.xlabel('n_neighbours')
plt.title('KNeighborsClassifier accuracy')
plt.show() 
accuracy, Color_predictions, color_precision, color_classifier = train_model(KNeighborsClassifier(np.argmax(neighbours_accuracy)), X_train, y_train, X_test,y_test)
print("KNeighborsClassifier({0}) Color accuracy:  {1} ".format(np.argmax(neighbours_accuracy), accuracy))
fig5.savefig("KNeighborsClassifier_accuracy.jpg")


**Analiza pozostałych features**

In [None]:
#----------------------------- Other Features --------------------------------

# other_features = data [["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]]
X_train, X_test, y_train, y_test = train_test_split(normalize(data[["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]], axis=0), data['gender_label'], test_size=test_percentage, random_state=20)


print("Szukam najlpeszej liczby perceptronow miedzy 10 a 19 za wzorem:")
hidden_layer_size = 10
print("(no of inputs + no of outputs)^0.5 + (1 to 10)")
mlp_accuracy = [0]*(hidden_layer_size+9)
for i in range(hidden_layer_size,hidden_layer_size+9):
    accuracy, Other_Features, other_precision, other_classifier = train_model(
        MLPClassifier(hidden_layer_sizes=(i), max_iter=200, early_stopping = True, random_state = np.random.RandomState(47)), X_train, y_train, X_test, y_test)
    mlp_accuracy[i] = accuracy

accuracy, Other_Features, other_precision, other_classifier = train_model(MLPClassifier(hidden_layer_sizes=(np.argmax(mlp_accuracy)),max_iter=200, early_stopping = True, random_state = np.random.RandomState(47)), X_train, y_train, X_test,y_test)
print("MLPClassifier(hidden_layer_sizes={0}) Other_Features accuracy: {1}  ".format(np.argmax(mlp_accuracy),accuracy))



**Analiza tekstu**

In [None]:
def cleaning(s):
    s = str(s)
    s = s.lower()
    s = s.replace("'s",' is')
    s = s.replace("'re",' are')
    s = s.replace("'ve",' have')
    s = re.sub('\s\W',' ',s) #whitespace characters
    s = re.sub('\W,\s',' ',s)
    s = re.sub(r'[^\w]', ' ', s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = re.sub('[!@#$_]', '', s)
    s = s.replace("ù","")
    s = s.replace("ù", "")
    s = s.replace("û", "")
    s = s.replace("âù", "")
    s = s.replace("ü", "")
    s = s.replace("å", "")
    s = s.replace("â", "")
    s = s.replace("ä", "")
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace(",","")
    s = s.replace("[\w*"," ")
    return s



print("Przykladowy wpis: ")
print(data.text.get(64))
print("\n")
data['Tweets'] = [cleaning(s) for s in data['text']]
data['Description'] = [cleaning(s) for s in data['description']]

print("Przykladowy wpis poczyszczeniu ze znakow specjalnych: ")
print(data.Tweets.get(64))

print("\n")

data.dropna(inplace=True,axis=0) # pozbywam sie wierszy bez danych NaN

**Stemming**


Przeksztacamy odmienione słowa na formy podstawowe:


am, are, is $\Rightarrow$ be


car, cars, car's, cars' $\Rightarrow$ car 

In [None]:
def stop_words(data, column_name,splitted=False):
    #funcion deleting stop words 
    
    stop = set(stopwords.words('english'))
    notInStopSet = lambda word: word not in stop
    if splitted == False:
        data[column_name] = data[column_name].str.lower().str.split()

    data[column_name] = data[column_name].apply(lambda x: [item for item in x if notInStopSet(item)])

    if splitted == False:
        data[column_name] = data[column_name].apply(lambda x: ' '.join(x))
        
def stemming(data, column_name,splitted=False):
    #funcion stemming 
    sno = nltk.stem.SnowballStemmer('english')
    stop = set(stopwords.words('english'))
    notInStopSet = lambda word: word not in stop
    if splitted == False:
        data[column_name] = data[column_name].str.lower().str.split()

    data[column_name] = data[column_name].apply(lambda x: [sno.stem(item) for item in x if notInStopSet(item)])

    if splitted == False:
        data[column_name] = data[column_name].apply(lambda x: ' '.join(x))

        
stop_words(data, 'Tweets')
stop_words(data, 'Description')

print("Przykladowy wpis  usunieciu stop words: ")
print(data.Tweets.get(64))
print("\n")        
        
stemming(data, 'Tweets')
stemming(data, 'Description')

print("Przykladowy wpis po Stemmingu ")
print(data.Tweets.get(64))
print("\n")

Oglądamy dane:

In [None]:
Male = data[data['gender_label'] == 0]
Female = data[data['gender_label'] == 1]
Brand = data[data['gender_label'] == 2]
Male_Words = pd.Series(' '.join(Male['Tweets'].astype(str)).lower().split(" ")).value_counts()[:20]
Female_Words = pd.Series(' '.join(Female['Tweets'].astype(str)).lower().split(" ")).value_counts()[:20]
Brand_words = pd.Series(' '.join(Brand['Tweets'].astype(str)).lower().split(" ")).value_counts()[:20]
plot0 = Female_Words.plot(kind='bar',stacked=True, colormap='OrRd', title='Most used words by female users')
fig0 = plot0.get_figure()
fig0.savefig("female_words.jpg")

In [None]:
plot1 = Male_Words.plot(kind='bar',stacked=True, colormap='plasma', title='Most used words by male users' )
fig1 = plot1.get_figure()
fig1.savefig("male_words.jpg")


In [None]:
plot2 = Brand_words.plot(kind='bar',stacked=True, colormap='Paired', title='Most used words by brands')
fig2 = plot2.get_figure()
fig2.savefig("brand_words.jpg")



Będziemy głównie korzystać z *naiwnej metody Bayesa.*
W skrócie:

Zliczamy ile jest różnych słów:

Przykład:

    “A great game”            

    “A clean but forgettable game” 

Po wektoryzacji unigramowej:


                                       A   great  game  clean  but  forgettable  
    “A great game”                    [1    1     1      0     0       0     ]

    “A clean but forgettable game”    [1    0     1      1     1       1     ]




In [None]:
#------------------------TEXT ANALYSIS------------------------------------


y_all = data.gender_label
X_Tweets = data.Tweets
X_Description = data.Description


X_Tweets_train, X_Tweets_test, y_train, y_test = train_test_split(X_Tweets, y_all, test_size=test_percentage, random_state=20)
X_Description_train, X_Description_test, y_train, y_test = train_test_split(X_Description, y_all, test_size=test_percentage, random_state=20)
# print(y_test)Usredniona 64%

#Unigramy


def vektoriser(X_train, X_test,ngram = False):
    if ngram == False:
        vect= CountVectorizer(analyzer='word', token_pattern=r'[0-9a-zA-Z]{2,}',min_df=2)
    else:
        vect = CountVectorizer(analyzer='word', token_pattern=r'[0-9a-zA-Z]{2,}', ngram_range=(1, 2),min_df=2)
    vect.fit(X_train)
    return vect.transform(X_train), vect.transform(X_test), vect

# print("Typ danych = " , type(X_Tweets_test))
X_Tweets_train_dtm, X_Tweets_test_dtm, tweets_vect = vektoriser(X_Tweets_train, X_Tweets_test)
print(len(tweets_vect.vocabulary_))

X_Description_train_dtm, X_Description_test_dtm, description_vect = vektoriser(X_Description_train,X_Description_test)








*Naiwna metoda Bayesa*

   | Text         |        	      Tag|
       |--|--|
    |“A great game”            	|Male|
    |“The election was over”       | Female|
    |“Very clean match”           	|Male|
    |“A clean but forgettable game” 	|Male|
    |“It was a close election” 	  | Female|
    
Chemy policzyć prawdopodobieństwo, że słowa: "A very close game" wypowiedział mężczyzna
$$P(Male |\; A\;very\; close\; game) = \frac{P( A\;very\; close\; game |\; Male) \cdot P(Male)}{P(A\;very\; close\; game )}$$

Zależy nam jednak jedynie aby porównać prawdopodobieństwo, że powiedział to mężczyzna z prawdopodobieństwoem, żę zrobiła to kobieta lub wpis nie pochodził od człowieka. Pomijamy zatem mianownik.


Liczymy jedynie: $$P( A\;very\; close\; game |\; Male) \cdot P(Male)$$


Teraz wyjaśni się, dlaczego metoda jest "naiwna", otóż przyjmuje się, że:
$$P( A\;very\; close\; game |\; Male) = P(A |\; Male) \cdot P(very |\; Male) \cdot P(close |\; Male) \cdot P(game |\; Male)$$

    
    
    
    
    

In [None]:


accuracy, Tweets_predictions, tweet_precision, tweet_classifier = train_model(MultinomialNB(), X_Tweets_train_dtm, y_train, X_Tweets_test_dtm, y_test)
print("NB, Tweets WordLevel:  ", accuracy)



accuracy, Description_predictions, description_precision,description_classifier = train_model(MultinomialNB(), X_Description_train_dtm, y_train, X_Description_test_dtm,y_test)
print("NB, Description WordLevel:  ", accuracy)

probs = np.transpose(description_classifier.feature_log_prob_)
words = list(description_vect.vocabulary_.keys())
# print(words)
probs = description_classifier.predict_proba(description_vect.transform(words))
not_log_probabs = [[i for i in x] for x in probs]
# print(not_log_probabs)

all_gender_words = {x[1]: np.argmax(x[0])  for x in zip(probs,words) if max(x[0]) > 0.7}
Most_masculine_words = {key:value for key, value in all_gender_words.items() if value == 0}
print("Meskie slowa", Most_masculine_words.keys())
Most_masculine_words = {key:value for key, value in all_gender_words.items() if value == 1}

print('Kobiece slowa',  Most_masculine_words.keys())
Most_masculine_words = {key:value for key, value in all_gender_words.items() if value == 2}

print('Brandowe slowa', Most_masculine_words.keys())

import sys
from termcolor import colored, cprint





In [None]:
def print_colorful_tweet(Tweet):
     
    if type(Tweet) == str:
#         print('size == 1')
        input_frase = Tweet.split()
    else:
#         print('size != 1')
        input_frase = Tweet
        
    colors = {0:'\033[34m', 1:'\033[95m', 2:'\033[32m'} # blue, pink, green
    frase = str()
    if not input_frase:
        return ""
    for word in input_frase:
        if word in all_gender_words:
            gender = all_gender_words.get(word)
#             print(gender)
            color = colors.get(gender)
#             print(color)
            frase += colors.get(gender) + word + '\x1b[0m'+ " " 
        else:
            frase += word + " "
            pass
    print(frase)

print_colorful_tweet(data.Description.get(64))
print_colorful_tweet(data.Description.get(75))
print_colorful_tweet(data.Description.get(94))
print_colorful_tweet(data.Description.get(67))
print_colorful_tweet(data.Description.get(86))
print_colorful_tweet(data.Description.get(84))

Pewne ulepszenie do modelu można wprowadzić stosują n-gramy:

* unigram (1-gram)

|A| clean |but| forgettable| game  |
|--|--|--|--|--|

* bigram (2-gram)

|A clean | clean but| but  forgettable| forgettable game  |
|--|--|--|--|--|

* trigram (3-gram)

|A clean but | clean but  forgettable| but forgettable game  |
|--|--|--|

.



In [None]:
#Bigramy----------------------------------------------------------------------------------------------------------------------

X_Tweets_train_dtm, X_Tweets_test_dtm, tweets_vect = vektoriser(X_Tweets_train, X_Tweets_test,ngram=True)
X_Description_train_dtm, X_Description_test_dtm, description_vect = vektoriser(X_Description_train,X_Description_test,ngram=True)


accuracy, Tweets_predictions, tweet_precision, tweet_classifier = train_model(MultinomialNB(), X_Tweets_train_dtm, y_train, X_Tweets_test_dtm,y_test)
print("NB, Tweets N-Gram Vectors:", accuracy)
# print(tweet_classifier.predict_proba((tweets_vect.transform(X_Tweets))[1]))



accuracy, Description_predictions,description_precision,description_classifier  = train_model(MultinomialNB(), X_Description_train_dtm, y_train, X_Description_test_dtm,y_test)
print("NB, Description N-Gram Vectors:", accuracy)

*TF-IDF*  *Term frequency - Inverse document frequency
*




   | Text         |        	      Tag|
       |--|--|
    |“A great game”            	|Male|
    |“The election was over”       | Female|
    |“Very **clean** match”           	|Male|
    |“A **clean** but forgettable game” 	|Male|
    |“It was a close election” 	  | Female|
    
    

Obliczanie Term frequency
$$TF(clean, \;A\; clean\; but\; forgettable\; game) =  20\%$$
Obliczanie Inverse document frequency
$$ IDF(clean) = \ln{\frac{wszystkich\; wpisów}{wpisów\; zawierających\; słowo\; "clean"}} =\ln(\frac{5}{2}) $$

In [None]:
X_Tweets_train, X_Tweets_test, y_train, y_test = train_test_split(X_Tweets, y_all, test_size=test_percentage, random_state=20)

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(y_train)
valid_y = encoder.fit_transform(y_test)

xtrain_count =  X_Tweets_train_dtm
xvalid_count =  X_Tweets_test_dtm


tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)



tfidf_vect.fit(X_Tweets_train)

xtrain_tfidf =  tfidf_vect.transform(X_Tweets_train)
xvalid_tfidf =  tfidf_vect.transform(X_Tweets_test)


# print(tfidf_vect.vocabulary_)

# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(X_Tweets_train)
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(X_Tweets_train)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(X_Tweets_test)

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(X_Tweets_train)
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_Tweets_train)
xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_Tweets_test)



# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)[0]
print("NB, Tweets WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
accuracy = train_model(MultinomialNB(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram, valid_y)[0]
print("NB, Tweets N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
accuracy = train_model(MultinomialNB(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars, valid_y)[0]
print("NB, Tweets CharLevel Vectors: ", accuracy)

# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, valid_y)[0]
print("LR, Tweets Count Vectors: ", accuracy)

# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)[0]
print("LR, Tweets WordLevel TF-IDF: ", accuracy)

# Linear Classifier on Ngram Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram, train_y, xvalid_tfidf_ngram, valid_y)[0]
print("LR, Tweets N-Gram Vectors: ", accuracy)

# Linear Classifier on Character Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf_ngram_chars, train_y, xvalid_tfidf_ngram_chars, valid_y)[0]
print("LR, Tweets CharLevel Vectors: ", accuracy)



**Random Forests**

![obraz.png](https://blog.citizennet.com/hs-fs/hubfs/Imported_Blog_Media/RF.jpg?t=1536614411555&width=2356&name=RF.jpg)

In [None]:
# RF on Count Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_count, train_y, xvalid_count, valid_y)
print("RF, Count Vectors: ", accuracy)

# RF on Word Level TF IDF Vectors
accuracy = train_model(ensemble.RandomForestClassifier(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print("RF, WordLevel TF-IDF: ", accuracy)

**Modularne Sieci Neuronowe - Modular neural network**

*“Voting”, *


to najpopularniejsza strategia dla multi-module cooperation. 
Output każdego modułu, wskazujący na konkretną klasę traktowany jest jak jeden głos. Decyzja podejmowana jest zgodnie z przyjętą strategią głosowania -  “voting strategy.” 


Żródło: https://pdfs.semanticscholar.org/721c/f173c00fff7af5edca9bd8976407591f647c.pdf

*1.  Średnia ważona:*


Przykład:

Sieć A o dokładności 50%   [0.2, 0.6, 0.1]
Sieć B o dokładności 90%   [0.1, 0.5, 0.4]

Obliczamy: 
        50% * [0.2, 0.6, 0.1] + 90%  *  [0.1, 0.5, 0.4] =  [0.1, 0.3, 0.05] +  [0.09, 0.45, 0.36]
 
        Wynik = [0.19,0.75,0.41]

In [None]:
def getPrediction(Probabs):
    return [ np.argmax(x) for x in Probabs]



total_Predictions = (tweet_precision*Tweets_predictions + description_precision*Description_predictions + color_precision*Color_predictions + other_precision*Other_Features)
y_pred_class = getPrediction(total_Predictions)


print("Weighted mean bigrams: " , metrics.accuracy_score(y_test, y_pred_class))

Próbujemy pominąć najmniej efektywną sieć - Other Features:

In [None]:
total_Predictions = (tweet_precision*Tweets_predictions + description_precision*Description_predictions + color_precision*Color_predictions)
y_pred_class = getPrediction(total_Predictions)
print("Weighted mean bigrams: " , metrics.accuracy_score(y_test, y_pred_class))

2.  *Plurality 
voting. *


It 
is 
the 
most  common 
voting 
scheme. 
Each  voter  votes 
for 
one 
alternative, 
and the 
alternative  with 
the 
largest 
number 
of 
votes  wins 
[9]. 
The 
advantage 
of 
this 
scheme, 
from 
the 
NN 
perspective,  is 
that 
it 
only 
uses 
the 
highest 
output 
value, 
which 
is 
the 
most probable 
output 
to 
be true. 
However, 
it 
does 
not 
consider 
the outputs’ 
preferences. 



Przykład:

Sieć A                    [0.2, **0.6**, 0.1]

Sieć B                    [0.1, **0.5**, 0.4]

Sieć C                    [**0.6**, 0.3, 0.1]


Obliczamy: 
         [0, 1, 0] +  [0, 1, 0] + [1, 0, 0] = [1, **2**, 0] 
 
        Wynik = [0 , 1 , 0]
        
 3. *Borda 
count 
voting. *


For 
m 
alternatives, 
assign 
 

1 
points 
for 
the 
alternative  ranked 
first, 
 
2 
points 
for 
the 
second, 
 
and 
so 
on. 



The 
alternative 
ranked 
last 
receives 
zero 
points. 
The 
points 
given 
to 
the 
different 
alternatives 
(modules) 
are summed 
up 
and 
the 
highest 
is 
the 
winner.

Przykład:

Sieć A                    [0.2, **0.6**, 0.1]

Sieć B                    [0.1, **0.5**, 0.4]

Sieć C                    [**0.6**, 0.3, 0.1]


Obliczamy: 
         [1, 2 0] +  [0, 2, 1] + [2, 1, 0] = [3, **5**, 1] 
 
        Wynik = [0 , 1 , 0]
        
4. *Fuzzy voting.* 
 
Each  voter  assigns 
a 
number 
between zero 
and 
one 
for each 
candidate. 
Compare 
the summation 
of 
the 
votes’ values for 
all 
candidates. 
The 
higher 
is 
the 
winner.   And 
since 
the 
modules’ 
outputs 
are 
normally 
between 
the 
zero 
and 
one 
values, 
the 
value 
will 
represent 
the true 
voting bid.         

Przykład:

Sieć A                    [0.2, **0.6**, 0.1]

Sieć B                    [0.1, **0.5**, 0.4]

Sieć C                    [**0.6**, 0.3, 0.1]


Obliczamy: 
          [0.2, **0.6**, 0.1] + [0.1, **0.5**, 0.4]+  [**0.6**, 0.3, 0.1]= [0.9, **1.4**, 0.6] 
 
        Wynik = [0 , 1 , 0]
        
5. *Nash 
voting.* 
It 
is similar 
to 
the 
Fuzzy 
voting 
but 
compares 
the 
product 
of 
the 
votes’ values 
for 
all candidates. 
The 
higher 
is  the 
winner. 


Przykład:

Sieć A                    [0.2, **0.6**, 0.1]

Sieć B                    [0.1, **0.5**, 0.4]

Sieć C                    [**0.6**, 0.3, 0.1]


Obliczamy: 
          [0.2, **0.6**, 0.1] .* [0.1, **0.5**, 0.4].*  [**0.6**, 0.3, 0.1]= [0.012, **0.09**, 0.004] 
 
        Wynik = [0 , 1 , 0]
        

In [None]:

def myEnseblePredict(estimators, data, voting="Fuzzy voting"):
    """Modular Neural Network Voting
          zrodlo: Voting  Schemes For Cooperative  Neural  Network  Classifiers
          https://pdfs.semanticscholar.org/721c/f173c00fff7af5edca9bd8976407591f647c.pdf
          
    Keyword arguments:
    estimators -- lista par (Klasyfikator, kolumny) 
    data -- dane do predykcji/klasyfikacji
    voting -- rodzaj głowsowania
    
    Returns:
    
    result -- klasyfikacja
    """
    
    def prod(iterable):
        """iloczyn elementów listy"""
        return reduce(operator.mul, iterable, 1)
    
    def plurality(row):
        """wyostrzanie listy [0.2, 0.6, 0.2] -> [0, 1, 0]"""
        maxIdx = np.argmax(row)
        result = [0 for x in row]
        result[maxIdx] = 1
        return result

    def borda(row):
        """zliczanie głosów typu border [0.1, 0.6, 0.3] -> [0, 2, 1]"""
        count = len(row)
        row_copy = row
        maxIdx = [0]*count
        result = [0 for x in row]
        for  i in range(count-1):
            maxIdx[i] = np.argmax(row_copy)
            row_copy[maxIdx[i]] = 0
        for i in range(count-1):
            result[maxIdx[i]] = count - 1 - i
        return result

    votes = [[0,0,0] for x in data.index]
    for el in estimators:
        estimator = el[0]
        columns = el[1]
        dane = data.loc[:, data.columns.isin(columns)]
        if len(el) == 3:
            # print("Typ do wektoryzacji:", type(data))
            dane = el[2].transform(dane.T.squeeze())
            # print("Dane po wektoryzacji:" , dane)

        temp = estimator.predict_proba(dane)
        # print(temp)
        if voting == "plurality":
            temp = [plurality(x) for x in temp]
            # print(temp)
        if voting == "Borda _count_voting":
            temp = [borda(x) for x in temp]
        if voting == "Fuzzy voting":
            pass

        if voting == "Nash voting":
            votes = [list(map(prod, zip(*t))) for t in zip(votes, temp)]
        else:
            votes = [list(map(sum, zip(*t))) for t in zip(votes, temp)]

    # print("Final votes:" , votes)
    result = [np.argmax(vote) for vote in votes]
    return result


Dzielimy dane na testowe i uczące.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, data['gender_label'], test_size=test_percentage, random_state=20)

In [None]:





plurality_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( other_classifier,["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='plurality')
print("Podejscie pluralistyczne: " , metrics.accuracy_score(y_test, plurality_votes))

borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( other_classifier,["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Borda _count_voting')
print("Podejscie Borda count voting: " , metrics.accuracy_score(y_test, borda_votes))

borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( other_classifier,["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Fuzzy voting')
print("Podejscie Fuzzy voting voting: " , metrics.accuracy_score(y_test, borda_votes))

borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( other_classifier,["_golden" , "_trusted_judgments" , "fav_number", "retweet_count" , "tweet_count" , "profile_yn:confidence"]), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Nash voting')
print("Podejscie Nash voting voting: " , metrics.accuracy_score(y_test, borda_votes))


print("Rezygnujemy z Other_Features:")

plurality_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']),  ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='plurality')
print("Podejscie pluralistyczne: " , metrics.accuracy_score(y_test, plurality_votes))

borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Borda _count_voting')
print("Podejscie Borda count voting: " , metrics.accuracy_score(y_test, borda_votes))



borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Nash voting')
print("Podejscie Nash voting voting: " , metrics.accuracy_score(y_test, borda_votes))

borda_votes = myEnseblePredict(estimators=[( color_classifier, ['rl','gl','bl','rs','gs','bs']), ( tweet_classifier, ["Tweets"], tweets_vect), ( description_classifier, ["Description"], description_vect)],data = X_test, voting='Fuzzy voting')
print("Podejscie Fuzzy voting voting: " , metrics.accuracy_score(y_test, borda_votes))



results = confusion_matrix(y_test, y_pred_class)
print('Confusion Matrix :')
print(results)
# print ('Accuracy Score :',accuracy_score(y_test, y_pred_class))
print('Report : ')
print(classification_report(y_test, y_pred_class))



null_accuracy = y_test.value_counts().head(1) / len(y_test)
print('Null accuracy:', null_accuracy)

# Manual calculation of null accuracy by always predicting the majority class
print('Manual null accuracy:',(1172 / (1172 + 1084 + 849)))




Co można jeszcze zrobić (where to go next)

*Lepsza miara dokładności modelu - accuracy measure

http://regulatorygenomics.upf.edu/courses/Master_AGB/2_ClassificationAlgorithms/Lecture_Accuracy.pdf

strony: 28-32

* Rozdzielic ceche gender na human, sex
Moznaby zobaczyć, jak zachodzi odróżnanie człowieka od firmy, kobiety od mężczyzny jeśli wiadomo, że jest to człowiek