We're interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting.

We'll compare these two types of posts to determine the following:

1. Do Ask HN or Show HN receive more comments on average?
2. Do posts created at a certain time receive more comments on average?
3. Do Ask HN or Show HN receive more points on average?
4. Are posts created at a certain time more likely to receive more points?
5. Compare results to the average number of comments and points other posts receive.

In [2]:
import pandas as pd

In [3]:
url = "hacker_news.csv"

In [4]:
df = pd.read_csv(url)

In [5]:
df.head(5)

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,10975351,How to Use Open Source and Shut the Fuck Up at...,http://hueniverse.com/2016/01/26/how-to-use-op...,39,10,josep2,1/26/2016 19:30
2,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
3,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
4,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12


In [40]:
df['title'] = df['title'].str.lower()

In [45]:
# Ask HN Posts
ask_hn = df[df.title.str.contains("ask hn")]

# Show HN Posts
show_hn = df[df.title.str.contains("show hn")]

# Other Posts
other = df[~df.title.str.contains("ask hn|show hn")]

In [58]:
show_hn.shape

(1170, 8)

Determine if ask posts or show posts receive more comments on average.

In [47]:
# Average Comments
avg_ask_comments = ask_hn.num_comments.mean()
print('Average Ask HN Comments: ', avg_ask_comments)

Average Ask HN Comments:  14.031518624641834


In [48]:
avg_show_comments = show_hn.num_comments.mean()
print('Average Show HN Comments: ', avg_show_comments)

Average Show HN Comments:  10.283760683760685


ASK HN posts receive more comments on average

Determine if ask posts created at a certain time are more likely to attract comments. 

In [49]:
# Convert created_at to datetime
df['created_at'] = pd.to_datetime(df['created_at'])

In [57]:
# Calculate number of Ask HN Posts created in each hour of the day
ask_hn['hour'] = df['created_at'].dt.hour

ask_by_hour = ask_hn.groupby('hour').size()

ask_by_hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ask_hn['hour'] = df['created_at'].dt.hour


hour
0      55
1      60
2      58
3      54
4      47
5      46
6      44
7      34
8      48
9      45
10     59
11     58
12     73
13     85
14    107
15    116
16    108
17    100
18    109
19    110
20     80
21    109
22     71
23     69
dtype: int64

In [59]:
# Calculate number of Show HN Posts created in each hour of the day
show_hn['hour'] = df['created_at'].dt.hour

show_by_hour = show_hn.groupby('hour').size()

show_by_hour

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  show_hn['hour'] = df['created_at'].dt.hour


hour
0      31
1      28
2      30
3      27
4      26
5      19
6      17
7      26
8      35
9      30
10     37
11     45
12     62
13    101
14     86
15     78
16     93
17     93
18     61
19     55
20     60
21     48
22     46
23     36
dtype: int64

In [71]:
# Calculate average number of comments Ask HN Posts receive by hour created
avg_by_hour = ask_hn.groupby('hour').num_comments.mean()

avg_by_hour

hour
0      8.127273
1     11.383333
2     23.810345
3      7.796296
4      7.170213
5     10.086957
6      9.022727
7      7.852941
8     10.250000
9      5.577778
10    13.440678
11    11.051724
12     9.410959
13    14.741176
14    13.233645
15    38.594828
16    16.796296
17    11.460000
18    13.201835
19    10.800000
20    21.525000
21    16.009174
22     6.746479
23     7.898551
Name: num_comments, dtype: float64

In [72]:
# Identify the top 5 Hours for Ask Posts Comments
avg_by_hour.sort_values(ascending=False)

hour
15    38.594828
2     23.810345
20    21.525000
16    16.796296
21    16.009174
13    14.741176
10    13.440678
14    13.233645
18    13.201835
17    11.460000
1     11.383333
11    11.051724
19    10.800000
8     10.250000
5     10.086957
12     9.410959
6      9.022727
0      8.127273
23     7.898551
7      7.852941
3      7.796296
4      7.170213
22     6.746479
9      5.577778
Name: num_comments, dtype: float64

Determine if show or ask posts receive more points on average.

In [74]:
avg_ask_points = ask_hn[['num_points']].mean()

avg_ask_points

num_points    15.05788
dtype: float64

In [75]:
avg_show_points = show_hn[['num_points']].mean()

avg_show_points

num_points    27.416239
dtype: float64

Determine if posts created at a certain time are more likely to receive more points.

In [76]:
avg_ask_points_hour = ask_hn.groupby('hour').num_points.mean()

avg_ask_points_hour

hour
0      8.200000
1     11.666667
2     13.672414
3      6.925926
4      8.276596
5     12.000000
6     13.431818
7     10.617647
8     10.729167
9      7.311111
10    18.677966
11    14.224138
12    10.712329
13    24.258824
14    11.981308
15    29.991379
16    23.351852
17    19.410000
18    15.972477
19    13.754545
20    14.387500
21    15.788991
22     7.197183
23     8.536232
Name: num_points, dtype: float64

In [77]:
avg_show_points_hour = show_hn.groupby('hour').num_points.mean()

avg_show_points_hour

hour
0     37.838710
1     25.000000
2     11.333333
3     25.148148
4     14.846154
5      5.473684
6     22.176471
7     19.000000
8     15.000000
9     18.433333
10    18.432432
11    33.155556
12    41.241935
13    24.316832
14    25.430233
15    28.564103
16    28.322581
17    27.107527
18    36.311475
19    30.945455
20    30.316667
21    18.145833
22    40.347826
23    42.388889
Name: num_points, dtype: float64

Compare to the average number of comments and points other posts receive.

In [80]:
# Calculate number of other posts created in each hour of the day
other['hour'] = df['created_at'].dt.hour

other_by_hour = show_hn.groupby('hour').size()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  other['hour'] = df['created_at'].dt.hour


In [82]:
avg_other_comments_hour = other.groupby('hour').num_comments.mean()

avg_other_comments_hour

hour
0     27.076923
1     23.072000
2     27.786848
3     26.825553
4     24.125551
5     25.175258
6     21.357843
7     26.808036
8     27.026210
9     27.588015
10    26.612521
11    29.637329
12    30.347275
13    30.920393
14    32.330898
15    29.519231
16    25.394187
17    27.995723
18    26.924354
19    26.701020
20    23.139407
21    23.632302
22    23.265172
23    24.650817
Name: num_comments, dtype: float64

In [83]:
avg_other_points_hour = other.groupby('hour').num_points.mean()

avg_other_points_hour

hour
0     58.458265
1     50.606000
2     58.471655
3     56.921376
4     49.667401
5     49.966495
6     46.235294
7     56.832589
8     54.092742
9     53.936330
10    60.483926
11    57.637329
12    57.397972
13    62.585605
14    61.786013
15    60.542308
16    54.182561
17    57.978614
18    53.928967
19    60.011224
20    45.244786
21    49.420389
22    50.236148
23    52.095097
Name: num_points, dtype: float64