# Data Exploration

In [23]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import ydata_profiling

In [2]:
# Load data
data_path = 'data/meeting notes.csv'
df = pd.read_csv(data_path)

In [3]:
df.head()

Unnamed: 0,ID,RAW_CONTENT,MEETING_NAME,TITLE,PUBLISH_DATE
0,m-1608357555-1682084460,中共中央总书记、国家主席、中央军委主席、中央全面深化改革委员会主任习近平4月21日下午主...,中央全面深化改革会议,会议审议通过了关于强化企业科技创新主体地位的意见关于加强和改进国有经济管理有力支持中国式现代...,2023-04-21T21:41:00Z
1,m-158629831-1682664300,中共中央政治局4月28日召开会议，分析研究当前经济形势和经济工作。中共中央总书记习近平主...,政治局会议,分析研究当前经济形势和经济工作,2023-04-28T14:45:00Z
2,m-704972677-1680162720,中共中央政治局3月30日召开会议，决定从今年4月开始，在全党自上而下分两批开展学习贯彻习...,政治局会议,决定从2023年4月开始在全党自上而下分两批开展学习贯彻习近平新时代中国特色社会主义思想主题...,2023-03-30T15:52:00Z
3,m-248496144-1366366560,中共中央政治局4月19日召开会议，决定从今年下半年开始，用一年左右时间，在全党自上而下分...,政治局会议,研究部署在全党深入开展党的群众路线教育实践活动,2013-04-19T18:16:00Z
4,m-1020277137-1469618520,李克强主持召开国务院常务会议\n听取关于地方和部门推进重大项目落地审计情况汇报 完善奖惩机制...,国务院常委会,李克强主持召开国务院常务会议通过十三五国家科技创新专项规划 以创新型国家建设引领和支撑升级发...,2016-07-27T19:22:00Z


In [4]:
df.shape

(531, 5)

In [5]:
# Count unique notes and unique meetings
df['ID'].nunique(), df['TITLE'].nunique()

(531, 496)

In [5]:
df['MEETING_NAME'].value_counts()

国务院常委会        280
政治局会议         120
中央全面深化改革会议     68
历年政府工作报告       52
中央经济工作会议       11
Name: MEETING_NAME, dtype: int64

In [8]:
# Extract Year from Date
df['Year'] = df.apply(lambda row: row['PUBLISH_DATE'][:4], axis=1)

In [9]:
df['Year'].value_counts()

2016    69
2020    65
2019    63
2018    60
2021    59
2017    58
2022    50
2015    32
2014    21
2013    15
2023     9
2012     5
2011     3
2010     3
2009     3
2008     3
2006     3
2003     3
2007     2
2005     2
2004     2
2002     1
Name: Year, dtype: int64

In [14]:
profile = ydata_profiling.ProfileReport(df)
profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]


Glyph 22269 (\N{CJK UNIFIED IDEOGRAPH-56FD}) missing from current font.


Glyph 21153 (\N{CJK UNIFIED IDEOGRAPH-52A1}) missing from current font.


Glyph 38498 (\N{CJK UNIFIED IDEOGRAPH-9662}) missing from current font.


Glyph 24120 (\N{CJK UNIFIED IDEOGRAPH-5E38}) missing from current font.


Glyph 22996 (\N{CJK UNIFIED IDEOGRAPH-59D4}) missing from current font.


Glyph 20250 (\N{CJK UNIFIED IDEOGRAPH-4F1A}) missing from current font.


Glyph 25919 (\N{CJK UNIFIED IDEOGRAPH-653F}) missing from current font.


Glyph 27835 (\N{CJK UNIFIED IDEOGRAPH-6CBB}) missing from current font.


Glyph 23616 (\N{CJK UNIFIED IDEOGRAPH-5C40}) missing from current font.


Glyph 35758 (\N{CJK UNIFIED IDEOGRAPH-8BAE}) missing from current font.


Glyph 20013 (\N{CJK UNIFIED IDEOGRAPH-4E2D}) missing from current font.


Glyph 22830 (\N{CJK UNIFIED IDEOGRAPH-592E}) missing from current font.


Glyph 20840 (\N{CJK UNIFIED IDEOGRAPH-5168}) missing from current font.


Glyph 38754 (\N{CJK UNIFIED IDEOGRAPH

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
# Find duplicated rows based on 'RAW_CONTENT'
duplicated_rows = df[df.duplicated(subset='RAW_CONTENT', keep=False)]

In [20]:
duplicated_rows

Unnamed: 0,ID,RAW_CONTENT,MEETING_NAME,TITLE,PUBLISH_DATE,Year
3,m-248496144-1366366560,中共中央政治局4月19日召开会议，决定从今年下半年开始，用一年左右时间，在全党自上而下分...,政治局会议,研究部署在全党深入开展党的群众路线教育实践活动,2013-04-19T18:16:00Z,2013
185,m-259356661-1366366560,中共中央政治局4月19日召开会议，决定从今年下半年开始，用一年左右时间，在全党自上而下分...,政治局会议,研究部署在全党深入开展党的群众路线教育实践活动工,2013-04-19T18:16:00Z,2013


In [21]:
duplicated_rows['TITLE']

3       研究部署在全党深入开展党的群众路线教育实践活动
185    研究部署在全党深入开展党的群众路线教育实践活动工
Name: TITLE, dtype: object

In [15]:
df2 = df.drop_duplicates(subset = 'RAW_CONTENT')

In [16]:
df2.shape

(530, 6)

In [24]:
# Plot the histogram of year
fig = go.Figure(data=[go.Histogram(x=df2['Year'])])
fig.update_layout(xaxis={'categoryorder': 'array', 'categoryarray': sorted(df2['Year'].unique())})
fig.show()

In [27]:
df2['Month'] = df2.apply(lambda row: row['PUBLISH_DATE'][5:7], axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [28]:
# Plot the histogram of month of year 2023
df_2023 = df2[df2['Year'] == '2023']
fig = go.Figure(data=[go.Histogram(x=df_2023['Month'])])
fig.update_layout(xaxis={'categoryorder': 'array', 'categoryarray': sorted(df_2023['Month'].unique())})
fig.show()

In [18]:
# Plot the histogram of MEETING_NAME
fig = px.histogram(df2, x='MEETING_NAME')
fig.show()

In [22]:
df2.to_csv('data/meeting notes clean.csv', index = False, encoding='utf_8_sig')