## Topic modeling

### Data aquisition, description and preparation

For this project I downloaded all the Italin parties' manifesto from [WZB](https://manifesto-project.wzb.eu/). The first year the data are aviable is 1963 and the data are collected until 2001. After this year the format of the data change and it is not possible to integret it in the code. The final dataset is made by 79 different manifestos over 19 rounds of election. Unfortunately I wasn't able to collect the party name, the column that specify the party is represented with a unique code.

The data were downloaded in a .csv format and cointains three columns, one containing the manifesto text and the other two are irrelevant (cmp_code and eu_code, mostly empty). The other information regarding the year and the party were stored in the name of each .csv. The function 'combine_csv' read all the document in a specific folder, combine them and extract the informantion mentioned above.

In [1]:
#import
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from stop_words import get_stop_words 
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt

In [2]:
def combine_csv(path):
    '''
    This function combines all the csv files in the given path and returns a dataframe
    
    Parameters:
    path: path of the folder containing the csv files
    
    Returns:
    df: dataframe containing all the csv files
    '''
    list_df = []
    for filename in os.listdir(path):
        df = pd.read_csv(path + '/'+ filename)
        df.drop(columns = ['cmp_code', 'eu_code'], inplace = True) # drop the columns (mostly nans)
        filename = filename.replace('.csv', '')
        df['party_code'] = filename.split('_')[0] #extract the party code from the filename
        df['year'] = filename.split('_')[1][:4] #extract the year from the filename
        list_df.append(df)
    final_df = pd.concat(list_df)
    final_df.sort_values(['year'], inplace = True)
    final_df.reset_index(inplace=True, drop=True)
    return final_df

df = combine_csv('data_assignment_2')
df

Unnamed: 0,text,party_code,year
0,Programma elettorale del P.R.I. Nelle polemic...,32410,1963
1,DEMOCRAZIA INDUSTRIALE. Oggi innestare il pr...,32420,1963
2,Il secolo. Il segretario nazionale del MSI ha ...,32710,1963
3,L’ORDINE DEL GIORNO APPROVATO D@ COMITATO CENT...,32330,1963
4,Segretario del PRI ha illustrato i due fondame...,32410,1968
...,...,...,...
74,PIANO DI GOVERNO PER UNA INTERA LEGISLATURA ...,32611,2001
75,"Rinnoviamo l’Italia, insieme Il programma del...",32220,2001
76,PIANO DI GOVERNO PER UNA INTERA LEGISLATURA ...,32610,2001
77,PIANO DI GOVERNO PER UNA INTERA LEGISLATURA ...,32710,2001


In [4]:
list_new_parties = check_new_party(df)
len(list_new_parties)

19

#### Crating feature matrix

For the specific task I need to create two diferent feature matrixes for each year, one for the new entry parties and the other regarding the already existing parties. To make the matrixes I use ths TFIDF method. This is the best tecnique to address this kind of quesiton because it increases the score of the less frequent words and this allow the code to spot better the words assoicieted with the new topics.  

An example of the code it will be used in the later functions is reported below:

In [9]:
list_manifestos = list(df.text) 
list_manifestos = list(map(lambda x: x.lower(), list_manifestos)) # convert all in lower characthers
vectorizer = TfidfVectorizer(stop_words=get_stop_words('italian'))
dfm_manifestos = vectorizer.fit_transform(list_manifestos)
dfm_manifestos

<79x30931 sparse matrix of type '<class 'numpy.float64'>'
	with 201474 stored elements in Compressed Sparse Row format>

### Research question

From a political science prospective new parties arise because of lack of interest in some hot topic that matters for the citizens. This theory is called [cleavage](http://www.u.arizona.edu/~mishler/LipsetRokkan.pdf) and it was presented in 1967 by Rokkan and Lipset. It focused mainly on some important historical issues such as the relationships between: state and church, or the land-industry investors.

It will be interesting to investigate if the Manifesto of the new parties match the theory behaind. The idea is to check the most mention topics in the manifestos of the first time candidate party and compare them with the already present ones.

The question that will be answered is: does the new party talk about issue that were not faced before?

## Topics modeling: Developing and Description

In [14]:
def check_new_party(df):
    '''
    This function checks if there is a new party in the dataframe
    
    Parameters:
    df: dataframe containing the data
    
    Returns:
    new_party: list of tuples (new parties, year of their first appearance)
    '''
    first_year_df = df[df['year'] == df['year'].min()] #list of parties already present in the first year
    list_party = list(first_year_df['party_code'])
    new_party_year = []
    for index, row in df.iterrows():
        if row['party_code'] not in list_party: #check if the party is already present in the list
            list_party.append(row['party_code']) #if not, add it to the party list
            new_party_year.append((row['party_code'], row['year'])) #add the party and the year to the output list
    return new_party_year

def divide_data_for_year(df, list_new_parties):
    '''
    This function divides the dataframe in two dataframes: one containing the old parties and one containing the new parties per year

    Parameters:
    df: dataframe containing the data
    list_new_parties: list of tuples (new parties, year of their first appearance)

    Returns:
    list_dfs_per_year: list of tuples (df_old_parties, df_new_parties)
    '''
    list_dfs_per_year = []
    df_new_parties = pd.DataFrame() #create dataframe for the new parties
    for year in df.year.unique():
        df_year = df[df['year'] == year] #filter the dataframe for the year
        for code, year_new_parties in list_new_parties:
            if year == year_new_parties: #check if the party is new in that year
                df_new_parties = df_year[df_year['party_code'] == code] #append to the new parties dataframe
                df_year = df_year.drop(df_year[df_year['party_code'] == code].index) #drop the new party from the old parties dataframe
        list_dfs_per_year.append((df_year, df_new_parties))
    return list_dfs_per_year

def plot_top_words(model, feature_names, n_top_words, title):
    n_components = model.n_components
    fig, axes = plt.subplots(n_components//5,5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    t_titles = []
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)
        
        t_titles.append(", ".join(top_features[:3]))

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
    title = title.lower().replace(" ", '')
    plt.savefig('plot/' + title + '.png')
    plt.show() 
    
    return t_titles   

def creating_dfms_and_plot_topics(list_dfs_per_year, N_topics = 5, words_for_topic = 10):
    ''' 
    This function creates the dfms for the old parties and the new parties per year
    
    Parameters:
    list_dfs_per_year: list of tuples (df_old_parties, df_new_parties)
    
    Returns:
    list_dfms_per_year: list of tuples (dfm_old_parties, dfm_new_parties)
    '''
    list_dfms_per_year = []
    for df_old_parties, df_new_parties in list_dfs_per_year:
        year = df_old_parties['year'].unique()[0]
        if df_new_parties.empty:
            continue 

        list_old_manifestos = list(df_old_parties['text']) #list of old parties manifestos
        list_old_manifestos = list(map(lambda x: x.lower(), list_old_manifestos)) # convert all in lower characthers
        vectorizer = TfidfVectorizer(stop_words=get_stop_words('italian'))
        dfm_old_manifestos = vectorizer.fit_transform(list_old_manifestos)
        nmf = NMF(N_topics)
        nmf.fit_transform(dfm_old_manifestos)
        t_titles = plot_top_words(nmf, vectorizer.get_feature_names_out(df_old_parties.text), words_for_topic, "Topics in existing parties in " + str(year), )

        print('new parties'+ str(year))
        list_new_manifestos = list(df_new_parties['text']) #list of new parties manifestos
        list_new_manifestos = list(map(lambda x: x.lower(), list_new_manifestos)) # convert all in lower characthers
        vectorizer = TfidfVectorizer(stop_words=get_stop_words('italian'))
        dfm_new_manifestos = vectorizer.fit_transform(list_new_manifestos)
        nmf = NMF(N_topics)
        nmf.fit_transform(dfm_new_manifestos)
        t_titles = plot_top_words(nmf, vectorizer.get_feature_names_out(df_new_parties.text), words_for_topic, "Topics in new parties in " + str(year), )
        print(t_titles)
        list_dfms_per_year.append((dfm_old_manifestos, dfm_new_manifestos))
        
    return list_dfms_per_year

In [13]:
%%capture
# call the function, now the plots are saved in the folder 'plot'

list_dfs_per_year = divide_data_for_year(df, list_new_parties)
list_dfms_per_year = creating_dfms_and_plot_topics(list_dfs_per_year,5,10)

The model used is NMF, this particoular model relay on a non negative matrix decomposition. It is a strong mathematical tools that is used in topic modeling to find the main contents of a text. NMF is one of the basic model but it is very effective. Usually the model works with an TFIDF feature matrix, as it was done in the code. The important hyper parameters that need to be choosen are: 
* the number of topic that the code spots (N_topics)
* the number of words each topic refers to (words_for_topic)

I set the two number respectively to 5 and 10, I reckon those values work fine for the task choosen. Changing them, the model may bring different and not really compatible plots. I belive that changing the N_topics may raise some issue because it will not reconise the extra topics. In other words, if the number of topics increase too much, the model harly will spot new topic in the new parties' manifesto since the same topic might arise in the old one even if was berely mentioned.  

## Answering the research question

Below are shown the plots of the 1976 election:

![Alt text](plot/topicsinexistingpartiesin1972.png)![Alt text](plot/topicsinnewpartiesin1972.png)

It is clear that the main concers of the new party (DC, christian democratic) differs a lot from the precedent issues tackeled. DC focused much more on the religion and pushing on a democratic appael that none of hte parties before used in them manifesto.   

Regarding the 1983 elections:

![Alt text](plot/topicsinexistingpartiesin1983.png)![Alt text](plot/topicsinnewpartiesin1983.png)

Also here it is possible to see a change in the topic by the new entry party. This time the new topic are alligned with the left ideology, it refers many time to the social and workers (lavoratori) issues. 

But probably the most interesting year to evaluate is 1994. At the time in Italy there was a nationwide judicial investigation into political corruption called 'Mani pulite'. After it many parties disapear and it was the beginning of the political career of Silvio Berlusconi, that he is still now part of the current government. He presented at the election as the only alternative to give a break from the precedent political class. Before he was in to the media business, knowig this informations it will be possible to analyse better the treated topics by his new party.

![Alt text](plot/topicsinexistingpartiesin1994.png)![Alt text](plot/topicsinnewpartiesin1994.png)

As expected one of the main topic of the Berlusconi party is justice that is not treated by any of the manifestos of the pre existing parties. Another focus of the manifesto is the media field, since it was he previous business and he heavly used it for propaganda during his candidature.

To conclude, it seems that the theory is confirmed by the data. Futher reserch are necessary to establushed compleate inference.