# Data Cleaning 

In [1]:
import pandas as pd
import numpy as np

# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [2]:
users = pd.read_csv('data/users.csv',sep='#')

## Check its shape

See the number of rows and columns you're dealing.

## Use the .head() to see some rows of your dataframe.

In [3]:
users.shape

(40503, 14)

In [4]:
users.info

<bound method DataFrame.info of           Id  Reputation         CreationDate        DisplayName  \
0         -1           1  2010-07-19 06:55:26          Community   
1          2         101  2010-07-19 14:01:36       Geoff Dalgas   
2          3         101  2010-07-19 15:34:50       Jarrod Dixon   
3          4         101  2010-07-19 19:03:27             Emmett   
4          5        6792  2010-07-19 19:03:57              Shane   
...      ...         ...                  ...                ...   
40498   6726           1  2011-10-09 13:16:20        AlexAtStack   
40499  53426         101  2014-08-05 07:54:54  John J. Camilleri   
40500  21468         101  2013-03-02 07:50:03           Peter L.   
40501  54132           1  2014-08-15 10:52:25          user54132   
40502  39943           1  2014-02-10 20:55:19        user1133128   

            LastAccessDate                      WebsiteUrl  \
0      2010-07-19 06:55:26  http://meta.stackexchange.com/   
1      2013-11-12 22:07:23 

In [5]:
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

In [6]:
isna=pd.DataFrame( users.isna().sum(), columns = ['missing_vl'])

In [7]:
isna.info

<bound method DataFrame.info of                  missing_vl
Id                        0
Reputation                0
CreationDate              0
DisplayName               6
LastAccessDate            0
WebsiteUrl            32345
Location              28772
AboutMe               31079
Views                     0
UpVotes                   0
DownVotes                 0
AccountId                 0
Age                   32151
ProfileImageUrl       23963>

In [8]:
isna.sort_values(by='missing_vl', ascending = False).head(7)

Unnamed: 0,missing_vl
WebsiteUrl,32345
Age,32151
AboutMe,31079
Location,28772
ProfileImageUrl,23963
DisplayName,6
Id,0


In [9]:
isna.mean()

missing_vl    10594.0
dtype: float64

In [10]:
# The columns with great number of missing values are: 
# WebsiteUrl
# Age
# AboutMe
# Location
# ProfileImageUrl

## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [11]:
users.rename({'Id': 'user_id'}, axis =1).head(3)

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,


In [12]:
users = users.rename({'Id': 'user_id'}, axis =1)

# Import the `posts.csv` dataset.

Note that this is a `gzip compressed csv`. In order to read this file correctly, you'll have to read the documentation (or help) of your `pd.read_csv()` function and check the `compression` argument. Try to understand which value of `compression=...` you should put in order to read your dataframe. 

In [13]:
posts = pd.read_csv('data/posts.csv.gzip',compression='gzip')

## Perform the same as above to understand a bit of your data (head, info, shape)

In [14]:
posts.shape

(91976, 21)

In [15]:
posts.info

<bound method DataFrame.info of            Id  PostTypeId  AcceptedAnswerId          CreaionDate  Score  \
0           1           1              15.0  2010-07-19 19:12:12     23   
1           2           1              59.0  2010-07-19 19:12:57     22   
2           3           1               5.0  2010-07-19 19:13:28     54   
3           4           1             135.0  2010-07-19 19:13:31     13   
4           5           2               NaN  2010-07-19 19:14:43     81   
...       ...         ...               ...                  ...    ...   
91971  115374           2               NaN  2014-09-13 23:45:39      2   
91972  115375           1               NaN  2014-09-13 23:46:05      0   
91973  115376           1               NaN  2014-09-14 01:27:54      1   
91974  115377           2               NaN  2014-09-14 02:03:28      0   
91975  115378           2               NaN  2014-09-14 02:09:23      0   

       ViewCount                                               Body

In [16]:
posts.head()

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [17]:
posts = posts.rename({'Id': 'post_id', 'OwnerUserId': 'user_id'},axis=1)
posts.head(10)

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,
5,6,1,,2010-07-19 19:14:44,152,29229.0,"<p>Last year, I read a blog post from <a href=...",5.0,2014-05-29 03:54:31,The Two Cultures: statistics vs. machine learn...,...,15.0,5,137.0,22047.0,2013-06-07 06:38:10,2010-08-09 13:05:50,,,,
6,7,1,18.0,2010-07-19 19:15:59,76,5808.0,<p>I've been working on a new method for analy...,38.0,2013-12-28 06:53:10,Locating freely available data samples,...,24.0,3,79.0,253.0,2013-09-26 21:50:36,2010-07-20 20:50:48,,,,
7,8,1,,2010-07-19 19:16:21,0,288.0,"<p>Sorry, but the emptyness was a bit overwhel...",37.0,2010-10-18 07:57:31,So how many staticians *does* it take to screw...,...,1.0,2,,449.0,2010-10-18 07:57:31,,,2010-07-19 20:19:46,,
8,9,2,,2010-07-19 19:16:27,13,,"<p><a href=""http://incanter.org/"">Incanter</a>...",50.0,2010-07-19 19:16:27,,...,,3,,,,2010-07-19 19:16:27,3.0,,,
9,10,1,1887.0,2010-07-19 19:17:47,23,21925.0,<p>Many studies in the social sciences use Lik...,24.0,2012-10-23 17:33:41,Under what conditions should Likert scales be ...,...,4.0,4,12.0,919.0,2011-03-30 15:31:46,,,,,


## Define new dataframes for users and posts with the following selected columns:

**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [18]:
users_columns = users[['user_id', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]
users_columns.head(2)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0


In [19]:
posts_columns = posts[['post_id', 'Score', 'user_id', 'ViewCount', 'CommentCount', 'Body']]
posts_columns.head(2)

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body
0,1,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...
1,2,22,24.0,8198.0,1,<p>In many different statistical methods there...


In [20]:
users_columns.info

<bound method DataFrame.info of        user_id  Reputation  Views  UpVotes  DownVotes
0           -1           1      0     5007       1920
1            2         101     25        3          0
2            3         101     22       19          0
3            4         101     11        0          0
4            5        6792   1145      662          5
...        ...         ...    ...      ...        ...
40498     6726           1      0        0          0
40499    53426         101      1        2          0
40500    21468         101      1        0          0
40501    54132           1      1        0          0
40502    39943           1      0        0          0

[40503 rows x 5 columns]>

In [21]:
posts_columns.info

<bound method DataFrame.info of        post_id  Score  user_id  ViewCount  CommentCount  \
0            1     23      8.0     1278.0             1   
1            2     22     24.0     8198.0             1   
2            3     54     18.0     3613.0             4   
3            4     13     23.0     5224.0             2   
4            5     81     23.0        NaN             3   
...        ...    ...      ...        ...           ...   
91971   115374      2    805.0        NaN             2   
91972   115375      0  49365.0        9.0             0   
91973   115376      1  55746.0        5.0             2   
91974   115377      0    805.0        NaN             0   
91975   115378      0   7250.0        NaN             0   

                                                    Body  
0      <p>How should I elicit prior distributions fro...  
1      <p>In many different statistical methods there...  
2      <p>What are some valuable Statistical Analysis...  
3      <p>I have two gr

**Note:** Check the new posts dataframe's info. What is the most noticeable change? 

Explain why we have chosen only some columns of it in terms of efficiency.

In [22]:
# It's lighter, and brings only the relevant information

# Merge the new dataframes you have created, of users and posts. Create a dataframe called `posts_from_users`

You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 

Think carefully which should be the key(s) for your merging.

In [23]:
posts_from_users = pd.merge(left=users_columns,
                            right=posts_columns,
                            on='user_id')
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [24]:
# There are 598 duplicated rows

posts_from_users.duplicated(keep=False).sum()

598

## Find those duplicate values and try to understand what happened.

*Hint:* You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.

*Hint 2:* You can sort the values `by=['user_id', 'post_id']` to see them in order.


In [25]:
mask = posts_from_users.duplicated(keep=False)

posts_from_users.loc[mask,:].sort_values(by=['user_id', 'post_id'])

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
8396,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
8399,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
8397,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
8400,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
8398,760,168,13,13,0,23987,0,62.0,3,<p>I was studying on a PAMI article and I have...
...,...,...,...,...,...,...,...,...,...,...
90343,54711,4,18,0,0,114527,0,45.0,5,<p>From Shapiro-Wilk's test I see that the res...
90368,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
90369,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
90460,54911,1,1,0,0,113691,0,36.0,11,<p>I extract data related to a movie by sentim...


## Should you drop it? If you think it is reasonable to drop it, then drop it.

Think: How would you correct it in the first place? That is, what was wrong in the first place?

*Hint:* There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it. 

In [26]:
# número de comentários repetidos na DataFrame de posts_columns:

posts_columns.duplicated(subset=['Body']).sum()

233

-> Yes, the duplicated rows should be droped, because they store exactly the same information.
What was wrong in the first place is that in the posts DataFrame, the data was already duplicated. 

In [27]:
# droping duplicates:

posts_from_users_dropado = posts_from_users.drop_duplicates()

In [28]:
posts_from_users_dropado

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


## 10. How many missing values do you have in your merged dataframe? On which columns?

-> there are 48,396 missing values in the following columns: [ViewCount, Body]

In [29]:
posts_from_users_dropado.isna().sum()

user_id             0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
post_id             0
Score               0
ViewCount       48396
CommentCount        0
Body              220
dtype: int64

## Select only the rows in which there at least some missing values.

In [30]:
mask = posts_from_users_dropado.isna().any(axis=1)


missing = posts_from_users_dropado.loc[mask,:]

missing

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90817,55605,1,2,0,0,115106,0,,0,"<p>Recasting this as a time-to-event problem, ..."
90820,55609,1,1,0,0,115115,2,,0,"<p>This is my favourite:</p>\n\n<p>""To be sure..."
90827,55621,1,1,0,0,115213,0,,0,<p>Here is the part that explains answer to yo...
90835,55637,26,4,0,0,115170,1,,0,"<p>When you say class, I hope you mean 'output..."


## You will need to make something with missing values.  Will you clean or fill them in? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [31]:
mask_1 = (missing.CommentCount == 0)
mask_2 = (missing.CommentCount > 0)

In [32]:
missing.loc[mask_2,:].sample(10)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
9196,805,65272,5680,7035,143,59790,1,,2,"<p>If I understand it right, Mallows distance ..."
32951,6961,2413,266,152,11,58983,2,,2,"<p>Under a simple null hypothesis, the samplin..."
25938,4505,25123,1245,582,4,45396,4,,1,<p>For a single covariance you only need the b...
77160,35917,1423,123,94,2,84117,1,,5,<p>I'm not sure there is ever a reason to use ...
4863,442,6431,973,857,21,2780,1,,1,<p>Normally you always find ways to convert ef...
37528,8013,7972,662,108,6,96631,1,,8,"<p>To be formal about it, there are very abstr..."
74714,32036,7227,991,2664,143,80929,0,,2,<p><strong>Edit: might not apply</strong> beca...
27160,4843,543,102,32,0,27234,1,,1,<p>A method such as that used in unsupervised ...
5741,582,2400,293,692,23,14801,0,,1,<p>You could use <code>image.smooth</code> in ...
40259,9007,1318,309,456,31,27220,7,,7,<p>The answer is almost always: <strong>report...


In [33]:
missing.sort_values(by='CommentCount', ascending=False)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
44796,11032,22275,7395,2619,42,31038,6,,45,<p>Consistency of an estimator means that as t...
13987,1124,3000,456,140,7,2365,5,,41,<p>The problem starts with your sentence :</p>...
79749,38102,208,28,43,0,92246,-5,,41,<p>Any paper that disproves the nil null hypot...
8604,795,5606,967,1547,8,6605,30,,37,"<p>for some reason, people have difficulty gra..."
27363,4856,17791,5927,2122,412,30160,12,,35,<p>It is true that each element of a multivari...
...,...,...,...,...,...,...,...,...,...,...
31952,6633,9429,1534,167,88,43705,8,,0,<p>The calculation of such probabilities has b...
31949,6633,9429,1534,167,88,40567,2,,0,"<p>Yes, the log likelihood ratio <em>must</em>..."
31935,6633,9429,1534,167,88,27653,2,,0,"<p>Speaking from an engineering viewpoint, the..."
31934,6633,9429,1534,167,88,27487,1,,0,<p>For <em>any</em> random variables $X$ and $...


-> Answer:\
I'm sure that the missing values of the ViewCount column which CommentCounts are 0, must be filled in as zero. 
What I could infer is that, as there are no zero values in this column, the 'NaN' must be 'zero'.

## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [34]:
# All the missing values from the ViewCount column must be filled in by zero

mask = posts_from_users_dropado.isna().any(axis=1)
posts_from_users_dropado.loc[mask,'ViewCount'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [35]:
posts_from_users_dropado

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,0.0,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,0.0,0,
2,-1,1,0,5007,1920,8578,0,0.0,0,
3,-1,1,0,5007,1920,8981,0,0.0,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,0.0,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


In [36]:
# All the missing values from the Body column must be filled in by an empty string

mask = posts_from_users_dropado.isna().any(axis=1)
posts_from_users_dropado.loc[mask,'Body'] = 'empty'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [37]:
posts_from_users_dropado

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,0.0,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,0.0,0,empty
2,-1,1,0,5007,1920,8578,0,0.0,0,empty
3,-1,1,0,5007,1920,8981,0,0.0,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,0.0,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


# Bonus 1: (filtering) What is the average number of comments for users who are above the average reputation?

*Hint:* Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > {avg_reputation}` and etc.

*Hint 2:* You could create a variable based on that condition and use the group by function perform the task above.

In [38]:
avg_reputation = posts_from_users_dropado.Reputation.mean()
avg_reputation

6282.395411993288

In [39]:
mask = posts_from_users_dropado.Reputation > avg_reputation
posts_from_users_dropado.loc[mask,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
211,5,6792,1145,662,5,6,152,29229.0,5,"<p>Last year, I read a blog post from <a href=..."
212,5,6792,1145,662,5,12,20,0.0,1,"<p>See my response to <a href=""http://stackove..."
213,5,6792,1145,662,5,32,12,0.0,0,"<p>I recommend R (see <a href=""http://cran.r-p..."
214,5,6792,1145,662,5,49,6,0.0,0,<p>You don't need to install any packages beca...
215,5,6792,1145,662,5,64,6,0.0,0,"<p>Yes, there are many methods. You would nee..."
...,...,...,...,...,...,...,...,...,...,...
74959,32036,7227,991,2664,143,111822,3,0.0,2,<p>I would work on trying to get the bifactor ...
74960,32036,7227,991,2664,143,113074,1,0.0,0,<p>The Mann–Whitney–Wilcoxon test is the simpl...
74961,32036,7227,991,2664,143,113379,2,0.0,0,<p>Categorical variables have finite sets of d...
74962,32036,7227,991,2664,143,113527,2,0.0,7,<p>No. Age is a continuous variable and should...


In [40]:
# The average number of comments for users above avg reputation is of 2.09
mask = posts_from_users_dropado.Reputation > avg_reputation
posts_from_users_dropado.loc[mask,['CommentCount']].mean()

CommentCount    2.087689
dtype: float64

# Bonus 2: (grouping) Group your dataframe by the Reputation of your user. Calculate the mean value of ViewCount and CommentCount for each reputation value.

Suppose the missing values on ViewCount are due a systemic error and you wanted to guess what values should have been there in the first place, but the system abended.

Would that be an interesting candidate for inputting the value for the missing `ViewCount` values? If so, input it with these values.

In [41]:
# The average number and view of comments for users above avg reputation is : 
mask = posts_from_users_dropado.Reputation > avg_reputation

posts_from_users_dropado.loc[mask,['Reputation','CommentCount','ViewCount']].groupby(by='Reputation',as_index=False).mean()

Unnamed: 0,Reputation,CommentCount,ViewCount
0,6431,2.206107,574.564885
1,6461,1.880952,3.321429
2,6524,2.307692,51.582418
3,6764,2.115702,738.0
4,6792,1.581197,1292.91453
5,6906,1.965116,46.145349
6,7227,1.643123,1.319703
7,7246,2.337423,33.993865
8,7461,1.781746,28.833333
9,7663,1.846715,15.587591


Would that be an interesting candidate for inputting the value for the missing ViewCount values? If so, input it with these values.

In [64]:
new_posts_users = posts_from_users.drop_duplicates()
new_posts_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


In [71]:
# average viewcount per reputation

# above average:
mask_aavg = new_posts_users.Reputation > avg_reputation

# below average:
mask_bavg = new_posts_users.Reputation < avg_reputation

# all missing values in the ViewCount column:
mask2 = new_posts_users.ViewCount.isna()

# there are 2 values to fill in the missing values in the Viewcounts columns, calculated by reputation category:
abv_avg = new_posts_users.loc[mask_aavg,'ViewCount'].mean()
blw_avg = new_posts_users.loc[mask_bavg,'ViewCount'].mean()

In [65]:
# all missing values for reputations below average:

new_posts_users.loc[(mask2 & mask_aavg),'ViewCount'] = abv_avg 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [66]:
new_posts_users.loc[(mask2 & mask_bavg),'ViewCount'] = blw_avg

In [70]:
new_posts_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,541.941608,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,541.941608,0,
2,-1,1,0,5007,1920,8578,0,541.941608,0,
3,-1,1,0,5007,1920,8981,0,541.941608,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,541.941608,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.000000,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.000000,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.000000,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.000000,2,<p>im analyzing an article for my studies with...


## refs

Sample database used: https://relational.fit.cvut.cz/dataset/Stats

Stack-overflow database: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/
