# Install and load libraries

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np
import ast
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
pd.set_option('max_colwidth',100)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 


from ipywidgets import widgets, interact, interactive, fixed
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import sys
import copy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.cluster import KMeans

# from third libraries
from lib.functions import lemmatized_sentence,plot_wordcloud,generate_topics

import nltk
# nltk.download('all')

ModuleNotFoundError: No module named 'lib.functions'

# Introduction

A literal study of philosophical works is necessary. Philosophical language is very abstract, but if you extract words from text and then observe and study them, you can draw many interesting conclusions. The philosophical issues studied by different schools and the philosophical fields studied by different philosophers may have commonalities as well as differences. Data analysis of text is the focus of this article. 

# Data Preprocessing

Import data from Kaggle: History of Philosophy (https://www.kaggle.com/kouroshalizadeh/history-of-philosophy) and take a look at the dimention and structure of dataset.

In [None]:
df = pd.read_csv('/Users/xiayiming/Desktop/philosophy_data.csv',encoding="UTF-8")
df.info()
df['title'].nunique()
df['author'].nunique()
df['school'].nunique()

As we may see, the dataset contains 360808 rows and 11 columns. Variables $original\_publication\_date$, $corpus\_edition\_date$, $sentence\_length$ are integer, while the rest of variables are object. No null values are detected. Moreover, the dataset contains 59 different books written by 36 authors from 13 distinct schools.

## Text Processing - NLP

More information can be extracted after NLP for variable $tokenized\_txt$ by eliminating stop words and lemmatize sentences using function 'lemmatized_sentence'. The lemmatized sentences are stored in variable $lemmatized\_str$ and the lengths for those sentences are stored in variable $lemmatized\_str\_len$.

In [None]:
df["lemmatized_str"] = df["tokenized_txt"].apply(
    lambda x: lemmatized_sentence(ast.literal_eval(x))
)
df["lemmatized_str_len"] = df["lemmatized_str"].apply(
    lambda x: len(x.split(" "))
)

# EDA

To process exploration for data, take a brief view over the data. I will only pick variables $title$, $author$, $school$, $original\_publication\_date$, $sentence\_length$, $sentence\_lowered$, $tokenized\_txt$, for exploratory data analysis.

In [None]:
df=df[['title','author','school','original_publication_date','sentence_length','sentence_lowered','tokenized_txt']]

## Part 1
some questions might be listed for better overviewing the dataset.

1. Which school has the most amount of titles in the dataset?
2. Which author is most productive?(with most titles and sentences)
3. which title(book) has the most sentences in this dataset? Are the sentences distributed as normal?
4. How many sentences per school? Is the amount of titles per school positively correlated to the amount of sentences per school?
5. What is the average length of sentence per title? Is it correlated to the amount of sentences per title?

## Part 2
Use plots and statistical analysis for answering the part1 problems and generate conclusions.

### Plots and tables

In [None]:
#problem1
# Change seaborn plot size
# fig = plt.gcf()
# fig.set_size_inches(8, 15)

df.groupby('school')['title'].nunique().sort_values(ascending=False).plot.barh()

In [None]:
#problem2
df.groupby('author')['title'].nunique().sort_values(ascending=False).head()
df.groupby('author')['title'].count().sort_values(ascending=False).head()

In [None]:
#problem3
df_1=df.groupby('title')['title'].count().sort_values(ascending=False).to_frame(name='count').reset_index()
df_1.head()
plt.hist(df_1['count'])

In [None]:
#problem4
df_2=df.groupby('school')['title'].count().to_frame(name='n_sentence').reset_index()
a_1=df.groupby('school')['title'].nunique()
df_2['n_title']=a_1.tolist()

df_2.head()
plt.scatter(df_2['n_sentence'],df_2['n_title'])
np.corrcoef(df_2['n_sentence'],df_2['n_title'])

In [None]:
#problem5
df_3=df.groupby(['title'])['title'].count().to_frame(name='n_sentence').reset_index()
df_4=df.groupby('title').mean()
df_3['mean_sentence_length']=df_4['sentence_length'].tolist()

df_3.head()
plt.scatter(df_3['n_sentence'],df_3['mean_sentence_length'])
np.corrcoef(df_3['n_sentence'],df_3['mean_sentence_length'])

### Observations

Problem1:
As shown above, Analytic has the most amount of works, beginning around the turn of the 20th century in the contemporary era. This may show that the main focus in the research of philosophy was focusing heavily on that at that particular period. However, Communism, Capitalism, Feminism and Empiricism have relatively equal amount of works with no apparent different.

Problem2:
I list the top 5 for analysizing Problem2. Nietzsche has 5 titles in the dataset while Aristotle has 48779 sentences. Hegel and Foucault both appeared in the two tables, having same amount of titles and ralatively large amount of sentences.

Problem3:
Aristotle - Complete Works has the most amount of sentences and Plato - Complete Works is in rank 2. The amount of sentences per title does not distributed normally.

Problem4:
There is no apparent correlation between the amount of titles per school and the amount of sentences per school.

Problem5:
There is no apparent correlation between the average length of sentence per title and the amount of sentences per title.

## Part 3
### Timeline figure and more insights

In [None]:
temp=df.groupby(by=['original_publication_date','school'])['title'].nunique().to_frame(name='count').reset_index()

#visualization of 'temp'
fig = plt.gcf()
# Change seaborn plot size
fig.set_size_inches(24, 8)

sns.barplot(y='count',x='original_publication_date',hue='school',data=temp)
ticks=plt.xticks(rotation='70')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)


This is the timeline showing the amount of works classified by schools at different time spots.

It starts from early 350 B.C. and lasts to late 20th century. The data does not contain much works for the medieval peroid. During Renaissance of the 15th and 16th centuries heralded the beginning of the modern period, a lot more schools took place by the various colors showing up in the figure.

As we may also observe, in 1888, 3 books from nietzsche were published. Nietzsche was productive at that specific year. From certain color continuous showing up on the timeline, we can also notice that there were obvious trends for some schools to be popular for a period of time. For instance, from 1781 to 1820, German\_idealism had been continuously publishing books. Similarly, Analytics showed the first work in 1910 and kept showing up from time to time, even till year 1985.

# Text Mining

## Part 1 Wordcloud

### Question: What are the heated topics per school? Do they share similar thoughts or not?

### An Overall Wordcloud Inspect

In [None]:
schools = df["school"].unique()
for school in schools:
    plot_wordcloud(
        school,
        50
    )

### Findings

From the wordclouds, some schools which might be abstract to us can be explored more specific due to the words shown in the wordclouds. For example, word 'socrates' appears more often in Plato, since Plato is the famous student of Socrates. As for Aristotle, it focus more on definition, unity and moreover, animals. Analytic discusses more about dream and psychology.

The focus of some schools can be predicted based on the definiton itself. For instance, the themes of Feminism, many of which were about the power status of women in education, reading, and work These are still very popular debate topics at present. Stoicism no doubt consists of debates about power, words and desire. Capitalism talks a lot about labour, nation and produce while Communism pays attentions on society, state, commodity. Empiricism makes more efforts on people, with a research on paper work for truth. Rationalism focuses most on thoughts.

### Compute TF-IDF Weighted Document-Term

In [None]:
for school in schools:
    test=df.loc[df["school"] == school]
    t=nltk.word_tokenize(test['lemmatized_str'].str.cat(sep=' '))
    result = tfidf.fit_transform(t)
    df_5=pd.DataFrame({'word_name':tfidf.get_feature_names(), 'idf':tfidf.idf_})
    df_5.tail()

### Interactive visualizations on Important Words in Schools

Below gives us an interaction over wordclouds and the maximum of words of wordclouds can be choosen as 20, 50, 100 and 150.

In [None]:
schools = df["school"].unique().tolist()

interact(plot_wordcloud,school=schools,maxword=[20,50,100,150])

## Part 2  Topic Modeling

Here I would like to use LDA algorithm to generate 10 topics generated by variable $lemmatized\_str$ per school. Then we shall guess what kinds of topics could be for each school.

In [None]:
for school in schools:
    generate_topics(school)

In [None]:
interact(generate_topics,school=schools)

## Part 3 Sentiment Analysis

### Question: Are there significant differences or categorical tendencies in the emotional biases of each school?
### Sentiment Bar Plot

In [None]:
analyzer = SentimentIntensityAnalyzer()
test_df=pd.DataFrame()

for school in schools:
    test = copy.deepcopy(df.loc[df["school"] == school])
    test['compound'] = test['sentence_lowered'].apply(lambda x:analyzer.polarity_scores(x)['compound'])
    test['sent_classification'] = test['compound'].apply(lambda x: 'positive' if x>=0.05 else 'negative' if x <=-0.05 else 'neutral')
    test_new=pd.DataFrame({school:test.groupby('sent_classification')['compound'].count()/test.shape[0]})
    test_df=test_df.append(test_new.T)

    

In [None]:
test_df['school']=test_df.index
test_df.plot(x='school', kind='bar', stacked=True,
        title='Stacked Bar Graph by dataframe')

### Findings from the plot
From the stacked bar plot I conclude that Capitalism, Empiricism and Rationalism talks mostly positive words while Continental and Feminism have most proportions for negative sentences. Phenomenology has the most percentages for neutral words comparing to other schools.

### Kmeans Clustering

In [None]:
X=test_df.drop(['school'], axis=1)

In [None]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

In [None]:
test_df['school']
y_kmeans

In [None]:
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

### Findings from Kmeans clustering

As we can observe from the cluster result and scatter plot above, Continental, Stoicism, Nietzsche and Feminism are from the same group and it can be infered as negative class while Empiricism, Rationalism and Capitalism are from positive class. The rest of schools are from netural class. 

The kmeans clustering results are consistent with the stacked bar plot and we indeed confirm the classificaiton for schools' sentiments correctly.

# Conclusion