# IST 652 Final Project - A Pantheon Exploration

# Introduction

Recorded human history has only been around for 5000 years. By understanding the makeup of Pantheon, all persons today both young and old are enriched with the knowledge of global, scientific and cultural development of humanity. The meaning of the word *Pantheon* describes a group of particularly respected, famous, or important people. Over the course of human history there have been thousands of such individuals that have made an impact on society. Due to the information age that began in the latter part of the 20th century, people all over the world now have access to the same biographies of these individuals through Wikipedia in multiple languages. What was once information of limited accessibility and supply, shelved in remote libraries, is now conveniently made available through this medium.

Credit goes to the Macro Connections group at the Massachusetts Institute of Technology Media Lab and their [Pantheon Project](https://www.kaggle.com/mit/pantheon-project). Not only was a Pantheon index made available to the general public, but a popularity index was created as well. One "barrier to entry" to this content is the number of languages available for each article. This is a key measure in determining the popularity of individuals in the Pantheon. "The simpler of the two measures, which we denote as L, is the number of different Wikipedia language editions that have an article about a historical character. The more sophisticated measure, which we name the Historical Popularity Index (HPI) corrects L by adding information on the age of the historical character, the concentration of page views among different languages, the coefficient of variation in page views, and the number of page views in languages other than English." [https://www.kaggle.com/mit/pantheon-project](https://www.kaggle.com/mit/pantheon-project)

The significance of historical figures may be debatable, but this report seeks to apply an objective measure of popularity to better understand how biographies of the Pantheon is being consumed in the current day and age. From Aristotle to Benjamin Franklin, Jesus Christ to Al Pacino, historical figures and their attributes will be measured with web analytics so that the reader may have insights into the following:

Research questions:

* Which historic characters are the most popular?
* When did they live and where are they from?
* What factors could have generated their popularity?
* Are there any observable trends in the categorical data provided?
* What clusters and groupings are in the data? How do groups compare in popularity?

# Analysis

Key analysis methods used in report:
* Data Cleaning
* Sorting and subsets of the data.
* Line and bar plots.
* Multiple regression

## About the Data

In [31]:
import os
import pandas as pd
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sn

In [32]:
dataFileName = "data/database.csv"
isExist = os.path.isfile(dataFileName)
if isExist == True:
    dfDirtyData = pd.read_csv(dataFileName, sep=",", header=0)
else:
    print("File not found:", os.getcwd())


### Loading Data
* 


In [33]:
# Showing inforamtion about the dataframe
dfDirtyData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11341 entries, 0 to 11340
Data columns (total 17 columns):
article_id                     11341 non-null int64
full_name                      11341 non-null object
sex                            11341 non-null object
birth_year                     11341 non-null object
city                           11341 non-null object
state                          2172 non-null object
country                        11308 non-null object
continent                      11311 non-null object
latitude                       10294 non-null float64
longitude                      10294 non-null float64
occupation                     11341 non-null object
industry                       11341 non-null object
domain                         11341 non-null object
article_languages              11341 non-null int64
page_views                     11341 non-null int64
average_views                  11341 non-null int64
historical_popularity_index    11341 non-null 

In [34]:
# Creating a varibale to hold fields for displaying purpose since not all will fit
columnToDisplayDirtyData = ['full_name', 'sex', 'birth_year', 'country', 'occupation','historical_popularity_index']

In [35]:
# Showing a single row of record
dfDirtyData.loc[1522]

article_id                             23671899
full_name                                Elisha
sex                                        Male
birth_year                              Unknown
city                                      Other
state                                       NaN
country                                 Unknown
continent                               Unknown
latitude                                    NaN
longitude                                   NaN
occupation                     Religious Figure
industry                               Religion
domain                             Institutions
article_languages                            41
page_views                              1338790
average_views                             32653
historical_popularity_index             25.5087
Name: 1522, dtype: object

In [36]:
# Showing the first 5 rows of the data in the data frame
dfDirtyData.loc[:,columnToDisplayDirtyData].head(5)

Unnamed: 0,full_name,sex,birth_year,country,occupation,historical_popularity_index
0,Aristotle,Male,-384,Greece,Philosopher,31.9938
1,Plato,Male,-427,Greece,Philosopher,31.9888
2,Jesus Christ,Male,-4,Israel,Religious Figure,31.8981
3,Socrates,Male,-469,Greece,Philosopher,31.6521
4,Alexander the Great,Male,-356,Greece,Military Personnel,31.584


In [37]:
# Showing the first 5 rows of the data in the data frame
dfDirtyData.loc[:,columnToDisplayDirtyData].tail(5)

Unnamed: 0,full_name,sex,birth_year,country,occupation,historical_popularity_index
11336,Sean St Ledger,Male,1984,United Kingdom,Soccer Player,11.1346
11337,Saina Nehwal,Female,1990,India,Athlete,10.6122
11338,Rūta Meilutytė,Female,1997,Lithuania,Swimmer,10.3821
11339,Vladimír Weiss,Male,1989,Slovakia,Soccer Player,10.2495
11340,Missy Franklin,Female,1995,United States,Swimmer,9.8794


## Data Cleaning

In [38]:
def removeCharFromNumber(inputValue):
    a = ''
    inputValue = str(inputValue)
    for char in inputValue:
        if char.isdigit() == True:
            a = a + char
    return a

In [39]:
# Finding the index of a column with alpha numberic values.
def findStringValueIndex(fieldName):
    badRowIndex = [] # initialize list of problematic rows
    for idx, value in enumerate(dfCleanData[fieldName]):
        try:
            int(value)
        except:
            newValue = ''
            newValue = removeCharFromNumber(value)
            if newValue != '':
                dfCleanData.loc[idx, col] = newValue
            else:
                badRowIndex.append(idx)
    return badRowIndex

In [40]:
# Creating a new copy of a dataframe
dfCleanData = dfDirtyData.copy()

In [41]:
# Renaming historical_popularity_index to popularity (short name)
dfCleanData.rename(columns={'historical_popularity_index': 'popularity'}, inplace=True)

In [42]:
columnToNumeric = ["article_languages", "birth_year", "latitude", "longitude", "page_views","average_views", "popularity"]
for col in columnToNumeric:
    indx = []
    indx = findStringValueIndex(col)
    if len(indx) > 0 :
        dfCleanData.loc[indx, col] = 0
# Converting the columns above to numeric data type
dfCleanData[columnToNumeric] = dfCleanData[columnToNumeric].apply(pd.to_numeric)

In [43]:
# List of columns to be converted into numeric data type
#columnToConvert = ["birth_year"]
# Making sure to replace the non-numeric values to 0 before converting to numeric
# Examples: 
#           dfCleanData.loc[findStringValueIndex('latitude'), 'latitude'] = 0
# The following code block is doing similiar to the examples but 
# through the iteration of the list 
# replaced incorrect birth year with an actual year
#dfCleanData['birth_year'] = dfCleanData['birth_year'].replace(['1237?'], 1237).replace(['530s'], 530)

In [44]:
# Looping through columns in the dataframe. 
# Replacing columns values with 'object' data type from 'NaN' to 'NA'
for column in dfCleanData.columns:
    if(dfCleanData[column].dtype == 'object'):
          dfCleanData[column].fillna(value="Unknown", inplace=True)

In [45]:
# Showing inforamtion about the dataframe
dfCleanData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11341 entries, 0 to 11340
Data columns (total 17 columns):
article_id           11341 non-null int64
full_name            11341 non-null object
sex                  11341 non-null object
birth_year           11341 non-null int64
city                 11341 non-null object
state                11341 non-null object
country              11341 non-null object
continent            11341 non-null object
latitude             11341 non-null float64
longitude            11341 non-null float64
occupation           11341 non-null object
industry             11341 non-null object
domain               11341 non-null object
article_languages    11341 non-null int64
page_views           11341 non-null int64
average_views        11341 non-null int64
popularity           11341 non-null float64
dtypes: float64(3), int64(5), object(9)
memory usage: 1.5+ MB


In [46]:
dirtyBirthYear = dfDirtyData[dfDirtyData['birth_year'] == '1237?'].loc[:,'birth_year']
cleanBirthYear = dfCleanData[dfCleanData['birth_year'] == 1237].loc[:,'birth_year']
print('Dirty Birth Year', dirtyBirthYear)
print()
print('Clean Birth Year', cleanBirthYear)


Dirty Birth Year 3009    1237?
Name: birth_year, dtype: object

Clean Birth Year 3009    1237
Name: birth_year, dtype: int64


In [47]:
# Creating a varibale to hold fields for displaying purpose since not all will fit
columnToDisplayCleanData = ['full_name', 'sex', 'birth_year', 'country', 'industry','domain','occupation','popularity']

In [48]:
# Showing the first 5 rows of the data in the dataset
dfCleanData.loc[:,columnToDisplayCleanData].head(5)

Unnamed: 0,full_name,sex,birth_year,country,industry,domain,occupation,popularity
0,Aristotle,Male,-384,Greece,Philosophy,Humanities,Philosopher,31.9938
1,Plato,Male,-427,Greece,Philosophy,Humanities,Philosopher,31.9888
2,Jesus Christ,Male,-4,Israel,Religion,Institutions,Religious Figure,31.8981
3,Socrates,Male,-469,Greece,Philosophy,Humanities,Philosopher,31.6521
4,Alexander the Great,Male,-356,Greece,Military,Institutions,Military Personnel,31.584


In [49]:
# Showing the last 5 rows of the data in the dataset
dfCleanData.loc[:,columnToDisplayCleanData].tail(5)

Unnamed: 0,full_name,sex,birth_year,country,industry,domain,occupation,popularity
11336,Sean St Ledger,Male,1984,United Kingdom,Team Sports,Sports,Soccer Player,11.1346
11337,Saina Nehwal,Female,1990,India,Individual Sports,Sports,Athlete,10.6122
11338,Rūta Meilutytė,Female,1997,Lithuania,Individual Sports,Sports,Swimmer,10.3821
11339,Vladimír Weiss,Male,1989,Slovakia,Team Sports,Sports,Soccer Player,10.2495
11340,Missy Franklin,Female,1995,United States,Individual Sports,Sports,Swimmer,9.8794


In [50]:
#sn.heatmap()

## Exploration

In [51]:
# Grouping total by continent and sex
dfTotalByContinentBySex = pd.DataFrame(dfCleanData.groupby(['industry','sex'])['article_id'].count())
dfTotalByContinentBySex .rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByContinentBySex.sort_values(by = ['industry'], ascending = True).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_count
industry,sex,Unnamed: 2_level_1
Activism,Female,38
Activism,Male,76
Business,Female,4
Business,Male,87
Companions,Female,94
Companions,Male,7
Computer Science,Female,1
Computer Science,Male,32
Dance,Female,7
Dance,Male,5


In [52]:
# Showing details dataset based on industry selected
industry = 'Companions'
dfCleanData[dfCleanData['industry'] == industry].loc[:,columnToDisplayCleanData].sort_values(by = ['birth_year'], ascending = False).head(10)

Unnamed: 0,full_name,sex,birth_year,country,industry,domain,occupation,popularity
9957,"Catherine, Duchess of Cambridge",Female,1982,United Kingdom,Companions,Public Figure,Companion,17.6612
10161,Kevin Federline,Male,1978,United States,Companions,Public Figure,Companion,17.1996
9647,"Mary, Crown Princess of Denmark",Female,1972,Australia,Companions,Public Figure,Companion,18.2924
8265,Vanessa Paradis,Female,1972,France,Companions,Public Figure,Companion,20.783
9771,Princess Máxima of the Netherlands,Female,1971,Argentina,Companions,Public Figure,Companion,18.035
9092,Queen Rania of Jordan,Female,1970,Kuwait,Companions,Public Figure,Companion,19.3986
9735,"Sophie, The Countess of Wessex",Female,1965,United Kingdom,Companions,Public Figure,Companion,18.111
8368,Michelle Obama,Female,1964,United States,Companions,Public Figure,Companion,20.633
6635,Laura Bush,Female,1946,United States,Companions,Public Figure,Companion,22.4461
5361,Priscilla Presley,Female,1945,United States,Companions,Public Figure,Companion,23.196


In [53]:
# Getting the count by country, occupation, industry, and domain
dfTotalByCountryByIndustry = pd.DataFrame(dfCleanData.groupby(['industry','occupation'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByCountryByIndustry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByCountryByIndustry.sort_values(by = ['total_count'], ascending = False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_count
industry,occupation,Unnamed: 2_level_1
Government,Politician,2530
Film And Theatre,Actor,1193
Team Sports,Soccer Player,1064
Language,Writer,954
Religion,Religious Figure,518
Music,Singer,437
Music,Musician,381
Philosophy,Philosopher,281
Natural Sciences,Physicist,268
Music,Composer,225


In [54]:
# Which historic characters are in the top ten most popular?
dfTopTenMostPopularPeople = dfCleanData.sort_values(by = ['popularity'], ascending = False)
dfTopTenMostPopularPeople = dfTopTenMostPopularPeople.head(10)
dfTopTenMostPopularPeople.loc[:,columnToDisplayCleanData]

Unnamed: 0,full_name,sex,birth_year,country,industry,domain,occupation,popularity
0,Aristotle,Male,-384,Greece,Philosophy,Humanities,Philosopher,31.9938
1,Plato,Male,-427,Greece,Philosophy,Humanities,Philosopher,31.9888
2,Jesus Christ,Male,-4,Israel,Religion,Institutions,Religious Figure,31.8981
3,Socrates,Male,-469,Greece,Philosophy,Humanities,Philosopher,31.6521
4,Alexander the Great,Male,-356,Greece,Military,Institutions,Military Personnel,31.584
5,Leonardo da Vinci,Male,1452,Italy,Invention,Science & Technology,Inventor,31.4644
6,Confucius,Male,-551,China,Philosophy,Humanities,Philosopher,31.3705
7,Julius Caesar,Male,-100,Italy,Government,Institutions,Politician,31.1161
8,Homer,Male,-800,Turkey,Language,Humanities,Writer,31.1087
9,Pythagoras,Male,-570,Greece,Philosophy,Humanities,Philosopher,31.0691


### Explaning the code
* jfdslajfasljfsd
### Which historic characters are the most popular?
* Aristotle, Plato, and Jesus Christ

In [55]:
dfTotalTop10ByCountry = pd.DataFrame(dfTopTenMostPopularPeople.groupby(['country'])['article_id'].count())
dfTotalTop10ByCountry.rename(columns={'':'country','article_id': 'total_count'}, inplace=True)
dfTotalTop10ByCountry 

Unnamed: 0_level_0,total_count
country,Unnamed: 1_level_1
China,1
Greece,5
Israel,1
Italy,2
Turkey,1


In [56]:
# Change to bar... use the dfCleanData
#plt.bar(dfTotalTop10ByCountry)

In [57]:
# Which historic characters are in the top ten least popular?
dfTopTenLeastPopularPeople = dfCleanData.sort_values(by = ['popularity'], ascending = True)
dfTopTenLeastPopularPeople = dfTopTenLeastPopularPeople.head(10)
dfTopTenLeastPopularPeople.loc[:,columnToDisplayCleanData].head(10)

Unnamed: 0,full_name,sex,birth_year,country,industry,domain,occupation,popularity
11340,Missy Franklin,Female,1995,United States,Individual Sports,Sports,Swimmer,9.8794
11339,Vladimír Weiss,Male,1989,Slovakia,Team Sports,Sports,Soccer Player,10.2495
11338,Rūta Meilutytė,Female,1997,Lithuania,Individual Sports,Sports,Swimmer,10.3821
11337,Saina Nehwal,Female,1990,India,Individual Sports,Sports,Athlete,10.6122
11336,Sean St Ledger,Male,1984,United Kingdom,Team Sports,Sports,Soccer Player,11.1346
11335,Jetro Willems,Male,1994,Netherlands,Team Sports,Sports,Soccer Player,11.3956
11334,Rebecca Soni,Female,1987,United States,Individual Sports,Sports,Swimmer,11.405
11333,Sun Yang,Male,1991,China,Individual Sports,Sports,Swimmer,11.6234
11332,Shane Long,Male,1987,Ireland,Team Sports,Sports,Soccer Player,11.7174
11331,Marc Albrighton,Male,1989,United Kingdom,Team Sports,Sports,Soccer Player,11.7258


In [58]:
dfTotalTop10LByContinentByCountry = pd.DataFrame(dfTopTenLeastPopularPeople.groupby([ 'country','sex'])['article_id'].count())
dfTotalTop10LByContinentByCountry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalTop10LByContinentByCountry

Unnamed: 0_level_0,Unnamed: 1_level_0,total_count
country,sex,Unnamed: 2_level_1
China,Male,1
India,Female,1
Ireland,Male,1
Lithuania,Female,1
Netherlands,Male,1
Slovakia,Male,1
United Kingdom,Male,2
United States,Female,2


In [59]:
# Getting the count by countinent and by country
dfTotalByContinentByCountry = pd.DataFrame(dfCleanData.groupby(['continent', 'country'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByContinentByCountry.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByContinentByCountry.sort_values(by = ['total_count'], ascending = False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_count
continent,country,Unnamed: 2_level_1
North America,United States,2168
Europe,United Kingdom,1145
Europe,France,866
Europe,Italy,808
Europe,Germany,747
Unknown,Unknown,438
Europe,Russia,374
Europe,Spain,296
Asia,Turkey,202
Europe,Poland,173


In [60]:
# Getting the count by countinent
dfTotalByContinent = pd.DataFrame(dfCleanData.groupby(['continent'])['article_id'].count())
# Renaming the default column name the dataframe created to a meaningfull name
dfTotalByContinent.rename(columns={'article_id': 'total_count'}, inplace=True)
dfTotalByContinent.sort_values(by = ['total_count'], ascending = False)

Unnamed: 0_level_0,total_count
continent,Unnamed: 1_level_1
Europe,6368
North America,2439
Asia,1188
Unknown,438
Africa,419
South America,366
Oceania,123


## Modeling

# Results

# Conclusion