## Getting word counts through time with Pandas and Resample

![photo](https://images0.persgroep.net/rcs/vbjkJLgd1LNxTV9xxW6nnaoHceQ/diocontent/131247562/_crop/0/49/1580/893/_fitwidth/763?appId=93a17a8fd81db0de025c8abd1cca1279&quality=0.8)

***

### 1. Open and prepare the dataset 

***

In [22]:
import warnings
warnings.filterwarnings('ignore') # only use this when you know the script and want to supress unnecessary warnings

import pandas as pd
# create df
dict={'year':['1950', '1951', '1952', '1953', '1954'],'text':['Cees Aart Arie Jan Otto Gijs Sef Toon', 
                                                              'Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aart Arie Jan Otto Gijs Sef Toon', 
                                                              'Aart Arie Toon', 
                                                              'Jan Otto', 
                                                              'Gijs']} 
df=pd.DataFrame(dict,index=['0', '1', '3', '4', '5'])
df

Unnamed: 0,year,text
0,1950,Cees Aart Arie Jan Otto Gijs Sef Toon
1,1951,Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aar...
3,1952,Aart Arie Toon
4,1953,Jan Otto
5,1954,Gijs


In [23]:
# Convert the string in the column 'year' (e.g. 1950) to a Pandas datetime object 
df['datetime']  = pd.to_datetime(df['year'], errors = 'coerce')

# Add a count (this will be useful later when making the tables)
df['count'] = 1
df

Unnamed: 0,year,text,datetime,count
0,1950,Cees Aart Arie Jan Otto Gijs Sef Toon,1950-01-01,1
1,1951,Cees Aart Arie Jan Otto Gijs Sef Toon Cees Aar...,1951-01-01,1
3,1952,Aart Arie Toon,1952-01-01,1
4,1953,Jan Otto,1953-01-01,1
5,1954,Gijs,1954-01-01,1


***

### 2. Getting the words counts

***

In [24]:
# Word counts for 'term of interest' per year (you can resample by year, month, or day,i.e. 'A-DEC', 'M', or 'D') 
df['term_of_interest'] = df['text'].str.count('Aart*')
df_word = df.set_index('datetime').resample('A-DEC')['term_of_interest'].sum()
df_word = df_word.reset_index()
print(df_word.sum())
df_word

term_of_interest    4
dtype: int64


Unnamed: 0,datetime,term_of_interest
0,1950-12-31,1
1,1951-12-31,2
2,1952-12-31,1
3,1953-12-31,0
4,1954-12-31,0


In [25]:
# Optional: Graph them with Plotly in a bar chart
import plotly.express as px
fig = px.bar(df_word, x='datetime', y='term_of_interest')
fig.update_layout(showlegend=False,
    xaxis_rangeslider_visible=False,
    width=500,
    height=500)  
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)')
fig.update_xaxes(title_text="Year", showgrid=True, gridwidth=0.3, gridcolor='LightGrey')
fig.update_yaxes(title_text="# Reference to term of interest", showgrid=True, gridwidth=0.3, gridcolor='LightGrey')
fig.show()