# Data Cleaning 

In [1]:
import pandas as pd

# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [2]:
users = pd.read_csv('../data/users.csv', sep='#')

## Check its shape

See the number of rows and columns you're dealing.

## Use the .head() to see some rows of your dataframe.

In [3]:
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

In [4]:
users.info()
users.isna().sum().sort_values(ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
Id                 40503 non-null int64
Reputation         40503 non-null int64
CreationDate       40503 non-null object
DisplayName        40497 non-null object
LastAccessDate     40503 non-null object
WebsiteUrl         8158 non-null object
Location           11731 non-null object
AboutMe            9424 non-null object
Views              40503 non-null int64
UpVotes            40503 non-null int64
DownVotes          40503 non-null int64
AccountId          40503 non-null int64
Age                8352 non-null float64
ProfileImageUrl    16540 non-null object
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB


WebsiteUrl         32345
Age                32151
AboutMe            31079
Location           28772
ProfileImageUrl    23963
DisplayName            6
AccountId              0
DownVotes              0
UpVotes                0
Views                  0
LastAccessDate         0
CreationDate           0
Reputation             0
Id                     0
dtype: int64

## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [5]:
users = users.rename(columns={'Id':'user_id'} )
users

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40498,6726,1,2011-10-09 13:16:20,AlexAtStack,2012-05-18 09:32:44,,,,0,0,0,203972,,
40499,53426,101,2014-08-05 07:54:54,John J. Camilleri,2014-08-05 08:54:37,http://johnjcamilleri.com,"Gothenburg, Sweden","<p>Accidental computational linguist, de facto...",1,2,0,34865,28.0,https://www.gravatar.com/avatar/5738c02070833b...
40500,21468,101,2013-03-02 07:50:03,Peter L.,2013-03-02 07:50:03,http://www.a1qa.com/,"Minsk, Belarus","<p>QA Manager with comprehensive, cold-blooded...",1,0,0,2211454,32.0,http://www.gravatar.com/avatar/cbd80a5b2a5257d...
40501,54132,1,2014-08-15 10:52:25,user54132,2014-08-15 10:52:25,,,,1,0,0,4894117,,


# Import the `posts.csv` dataset.

Note that this is a `gzip compressed csv`. In order to read this file correctly, you'll have to read the documentation (or help) of your `pd.read_csv()` function and check the `compression` argument. Try to understand which value of `compression=...` you should put in order to read your dataframe. 

In [6]:
posts = pd.read_csv('../data/posts.csv.gzip', compression='gzip')


## Perform the same as above to understand a bit of your data (head, info, shape)

In [7]:
posts.head()

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


In [8]:
posts.info()
posts.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91976 entries, 0 to 91975
Data columns (total 21 columns):
Id                       91976 non-null int64
PostTypeId               91976 non-null int64
AcceptedAnswerId         14700 non-null float64
CreaionDate              91976 non-null object
Score                    91976 non-null int64
ViewCount                42921 non-null float64
Body                     91756 non-null object
OwnerUserId              90584 non-null float64
LasActivityDate          91976 non-null object
Title                    42921 non-null object
Tags                     42921 non-null object
AnswerCount              42921 non-null float64
CommentCount             91976 non-null int64
FavoriteCount            13246 non-null float64
LastEditorUserId         44611 non-null float64
LastEditDate             45038 non-null object
CommunityOwnedDate       2467 non-null object
ParentId                 47755 non-null float64
ClosedDate               1610 non-null obje

(91976, 21)

## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [9]:
posts = posts.rename(columns={'Id':'post_id', 'OwnerUserId':'user_id'})
posts

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91971,115374,2,,2014-09-13 23:45:39,2,,"<p>This grew too long for a comment, but I thi...",805.0,2014-09-14 02:05:41,,...,,2,,805.0,2014-09-14 02:05:41,,115367.0,,,
91972,115375,1,,2014-09-13 23:46:05,0,9.0,<p>Assume a classification problem where there...,49365.0,2014-09-14 02:09:23,Detecting a consistent pattern in a dataset vi...,...,1.0,0,,,,,,,,
91973,115376,1,,2014-09-14 01:27:54,1,5.0,<p>My goal is to create a formula that can giv...,55746.0,2014-09-14 01:40:55,How to project video viewcount based on histor...,...,0.0,2,,7290.0,2014-09-14 01:40:55,,,,,
91974,115377,2,,2014-09-14 02:03:28,0,,<p>As a practical answer to the real questions...,805.0,2014-09-14 02:54:13,,...,,0,,805.0,2014-09-14 02:54:13,,115358.0,,,


## Define new dataframes for users and posts with the following selected columns:

**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [10]:
users_new = users.loc[: ,['user_id', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]
posts_new = posts.loc[: ,['Score', 'user_id', 'ViewCount', 'CommentCount', 'Body']]
users_posts = pd.concat([users_new, posts_new], axis=1)
users_posts

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,Score,user_id.1,ViewCount,CommentCount,Body
0,-1.0,1.0,0.0,5007.0,1920.0,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...
1,2.0,101.0,25.0,3.0,0.0,22,24.0,8198.0,1,<p>In many different statistical methods there...
2,3.0,101.0,22.0,19.0,0.0,54,18.0,3613.0,4,<p>What are some valuable Statistical Analysis...
3,4.0,101.0,11.0,0.0,0.0,13,23.0,5224.0,2,<p>I have two groups of data. Each with a dif...
4,5.0,6792.0,1145.0,662.0,5.0,81,23.0,,3,"<p>The R-project</p>\n\n<p><a href=""http://www..."
...,...,...,...,...,...,...,...,...,...,...
91971,,,,,,2,805.0,,2,"<p>This grew too long for a comment, but I thi..."
91972,,,,,,0,49365.0,9.0,0,<p>Assume a classification problem where there...
91973,,,,,,1,55746.0,5.0,2,<p>My goal is to create a formula that can giv...
91974,,,,,,0,805.0,,0,<p>As a practical answer to the real questions...


**Note:** Check the new posts dataframe's info. What is the most noticeable change? 

Explain why we have chosen only some columns of it in terms of efficiency.

In [11]:
users_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91976 entries, 0 to 91975
Data columns (total 10 columns):
user_id         40503 non-null float64
Reputation      40503 non-null float64
Views           40503 non-null float64
UpVotes         40503 non-null float64
DownVotes       40503 non-null float64
Score           91976 non-null int64
user_id         90584 non-null float64
ViewCount       42921 non-null float64
CommentCount    91976 non-null int64
Body            91756 non-null object
dtypes: float64(7), int64(2), object(1)
memory usage: 7.0+ MB


# Merge the new dataframes you have created, of users and posts. Create a dataframe called `posts_from_users`

You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 

Think carefully which should be the key(s) for your merging.

In [12]:
posts_from_users = users.merge(right=posts, how = 'left')
posts_from_users.shape

(109299, 34)

## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [13]:
posts_from_users.duplicated().sum()

373

## Find those duplicate values and try to understand what happened.

*Hint:* You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.

*Hint 2:* You can sort the values `by=['user_id', 'post_id']` to see them in order.


In [14]:
posts_from_users.shape

(109299, 34)

In [15]:
posts_from_users.sort_values(['user_id', 'post_id']).duplicated(keep=False)

0         False
1         False
2         False
3         False
4         False
          ...  
108921    False
108922    False
108923    False
108924    False
108925    False
Length: 109299, dtype: bool

## Should you drop it? If you think it is reasonable to drop it, then drop it.

Think: How would you correct it in the first place? That is, what was wrong in the first place?

*Hint:* There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it. 

In [16]:
#Não, pois tem um jeito mais fácil para fazer:
drops = posts_from_users.drop_duplicates()
drops

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2014-04-23 13:43:43,,,,,
1,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2011-03-21 17:40:28,,,,,
2,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2011-03-21 17:46:43,,,,,
3,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,919.0,2011-03-30 19:23:14,,,,,
4,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,919.0,2011-03-30 19:23:14,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108921,55743,1,2014-09-13 21:03:50,AussieMeg,2014-09-13 21:18:52,,,,0,0,...,,,,,,,,,,
108922,55744,6,2014-09-13 21:39:30,Mia Maria,2014-09-13 21:39:30,,,,1,0,...,0.0,2.0,,,,,,,,
108923,55745,101,2014-09-13 23:45:27,tronbabylove,2014-09-13 23:45:27,,United States,,0,0,...,,,,,,,,,,
108924,55746,106,2014-09-14 00:29:41,GPP,2014-09-14 02:05:17,,,"<p>Stats noobie, product, marketing &amp; medi...",1,0,...,0.0,2.0,,7290.0,2014-09-14 01:40:55,,,,,


## 10. How many missing values do you have in your merged dataframe? On which columns?

In [17]:
posts_from_users.isna().sum()

user_id                       0
Reputation                    0
CreationDate                  0
DisplayName                  30
LastAccessDate                0
WebsiteUrl                72102
Location                  61539
AboutMe                   61949
Views                         0
UpVotes                       0
DownVotes                     0
AccountId                     0
Age                       77008
ProfileImageUrl           76089
post_id                   18416
PostTypeId                18416
AcceptedAnswerId          94691
CreaionDate               18416
Score                     18416
ViewCount                 66961
Body                      18636
LasActivityDate           18416
Title                     66961
Tags                      66961
AnswerCount               66961
CommentCount              18416
FavoriteCount             96266
LastEditorUserId          65035
LastEditDate              64806
CommunityOwnedDate       106902
ParentId                  62054
ClosedDa

In [18]:
posts_from_users.T.isna().sum()

0         13
1         14
2         14
3         13
4         13
          ..
109294    25
109295    20
109296    20
109297    25
109298    24
Length: 109299, dtype: int64

## Select only the rows in which there at least some missing values.

In [19]:
mask = (drops != 'NaN')
drops.filter(mask)

  result = method(y)


Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2014-04-23 13:43:43,,,,,
1,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2011-03-21 17:40:28,,,,,
2,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2011-03-21 17:46:43,,,,,
3,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,919.0,2011-03-30 19:23:14,,,,,
4,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,919.0,2011-03-30 19:23:14,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108921,55743,1,2014-09-13 21:03:50,AussieMeg,2014-09-13 21:18:52,,,,0,0,...,,,,,,,,,,
108922,55744,6,2014-09-13 21:39:30,Mia Maria,2014-09-13 21:39:30,,,,1,0,...,0.0,2.0,,,,,,,,
108923,55745,101,2014-09-13 23:45:27,tronbabylove,2014-09-13 23:45:27,,United States,,0,0,...,,,,,,,,,,
108924,55746,106,2014-09-14 00:29:41,GPP,2014-09-14 02:05:17,,,"<p>Stats noobie, product, marketing &amp; medi...",1,0,...,0.0,2.0,,7290.0,2014-09-14 01:40:55,,,,,


## You will need to make something with missing values.  Will you clean or filling them? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [20]:
# Não, pois alguns dados que não estão podem significar coisas que não seria igual a 'zerar' o dado por exemplo.

## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [21]:
drops.sort_values(['user_id'])

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2014-04-23 13:43:43,,,,,
135,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2013-04-14 19:18:49,,,,,
136,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,88.0,2013-06-06 19:34:08,,,,,
137,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2013-06-19 09:27:23,,,,,
138,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,...,,0.0,,-1.0,2013-07-15 16:50:56,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108921,55743,1,2014-09-13 21:03:50,AussieMeg,2014-09-13 21:18:52,,,,0,0,...,,,,,,,,,,
108922,55744,6,2014-09-13 21:39:30,Mia Maria,2014-09-13 21:39:30,,,,1,0,...,0.0,2.0,,,,,,,,
108923,55745,101,2014-09-13 23:45:27,tronbabylove,2014-09-13 23:45:27,,United States,,0,0,...,,,,,,,,,,
108924,55746,106,2014-09-14 00:29:41,GPP,2014-09-14 02:05:17,,,"<p>Stats noobie, product, marketing &amp; medi...",1,0,...,0.0,2.0,,7290.0,2014-09-14 01:40:55,,,,,


# Bonus 1: (filtering) What is the average number of comments for users who are above the average reputation?

*Hint:* Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > {avg_reputation}` and etc.

*Hint 2:* You could create a variable based on that condition and use the group by function perform the task above.

In [30]:
avg_reputation = drops['Reputation'] > drops['Reputation'].mean()
drops[avg_reputation]

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
214,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,...,15.0,5.0,137.0,22047.0,2013-06-07 06:38:10,2010-08-09 13:05:50,,,,
215,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,...,,1.0,,,,2011-08-12 20:29:33,7.0,,,
216,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,...,,0.0,,,,,25.0,,,
217,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,...,,0.0,,,,,33.0,,,
218,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,...,,0.0,,5.0,2010-07-19 20:12:16,,51.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84191,32036,7227,2013-10-29 12:42:40,Nick Stauner,2014-09-13 16:32:54,http://www.linkedin.com/in/nickstauner,"Cleveland, OH",<h2><em>Personality &amp; social psychologist<...,991,2664,...,,2.0,,,,,111821.0,,,
84192,32036,7227,2013-10-29 12:42:40,Nick Stauner,2014-09-13 16:32:54,http://www.linkedin.com/in/nickstauner,"Cleveland, OH",<h2><em>Personality &amp; social psychologist<...,991,2664,...,,0.0,,32036.0,2014-08-25 17:42:04,,113072.0,,,
84193,32036,7227,2013-10-29 12:42:40,Nick Stauner,2014-09-13 16:32:54,http://www.linkedin.com/in/nickstauner,"Cleveland, OH",<h2><em>Personality &amp; social psychologist<...,991,2664,...,,0.0,,,,,113377.0,,,
84194,32036,7227,2013-10-29 12:42:40,Nick Stauner,2014-09-13 16:32:54,http://www.linkedin.com/in/nickstauner,"Cleveland, OH",<h2><em>Personality &amp; social psychologist<...,991,2664,...,,7.0,,,,,113525.0,,,


# Bonus 2: (grouping) Group your dataframe by the Reputation of your user. Calculate the mean value of ViewCount and CommentCount for each reputation value.

Suppose the missing values on ViewCount are due a systemic error and you wanted to guess what values should have been there in the first place, but the system abended.

Would that be an interesting candidate for inputting the value for the missing `ViewCount` values? If so, input it with these values.

In [34]:
drops['ViewCount'].mean()

556.6561581492367

In [35]:
drops['CommentCount'].mean()

1.894650269363243

In [37]:
#Sim, pois se a pessoa não tem Views, não deveria contar na reputação

## refs

Sample database used: https://relational.fit.cvut.cz/dataset/Stats

Stack-overflow database: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/
