![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Data Cleaning

## Introduction

We keep seeing a common phrase that 80% of the work of a data scientist is data cleaning. We have no idea whether this number is accurate but a data scientist indeed spends lots of time and effort in collecting, cleaning and preparing the data for analysis. This is because datasets are usually messy and complex in nature. It is a very important ability for a data scientist to refine and restructure datasets into a usable state in order to proceed to the data analysis stage.

In this exercise, you will both practice the data cleaning techniques we discussed in the lesson and learn new techniques by looking up documentations and references. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.

## Getting Started

Read the instructions for each cell and provide your answers. Make sure to test your answers in each cell and save. Jupyter Notebook should automatically save your work progress. But it's a good idea to periodically save your work manually just in case.


## Resources

[Data Cleaning Tutorial](https://www.tutorialspoint.com/python/python_data_cleansing.html)

[Data Cleaning with Numpy and Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/#python-data-cleaning-recap-and-resources)

[Data Cleaning Video](https://www.youtube.com/watch?v=ZOX18HfLHGQ)

[Data Preparation](https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html)

[Google Search](https://www.google.es/search?q=how+to+clean+data+with+python)

# Data Cleaning 

In [1]:
import pandas as pd

# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [2]:
users=pd.read_csv('../data/users.csv', sep='#')

## Check its shape

See the number of rows and columns you're dealing.

In [3]:
users.shape

(40503, 14)

## Use the .head() to see some rows of your dataframe.

In [4]:
users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

Expected output:
````
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB
````

In [5]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB


## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [6]:
users.rename(columns=({'Id':'user_id'}), inplace=True)
users.head()

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


# Import the `posts.csv` dataset.

Note that this is a `gzip compressed csv`. In order to read this file correctly, you'll have to read the documentation (or help) of your `pd.read_csv()` function and check the `compression` argument. Try to understand which value of `compression=...` you should put in order to read your dataframe. 

In [7]:
posts = pd.read_csv('../data/posts.csv.gzip', compression ='gzip')

## Perform the same as above to understand a bit of your data (head, info, shape)

In [8]:
posts.head()

Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,OwnerUserId,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


In [9]:
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91976 entries, 0 to 91975
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Id                     91976 non-null  int64  
 1   PostTypeId             91976 non-null  int64  
 2   AcceptedAnswerId       14700 non-null  float64
 3   CreaionDate            91976 non-null  object 
 4   Score                  91976 non-null  int64  
 5   ViewCount              42921 non-null  float64
 6   Body                   91756 non-null  object 
 7   OwnerUserId            90584 non-null  float64
 8   LasActivityDate        91976 non-null  object 
 9   Title                  42921 non-null  object 
 10  Tags                   42921 non-null  object 
 11  AnswerCount            42921 non-null  float64
 12  CommentCount           91976 non-null  int64  
 13  FavoriteCount          13246 non-null  float64
 14  LastEditorUserId       44611 non-null  float64
 15  La

In [10]:
posts.shape

(91976, 21)

## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [11]:
posts.rename(columns=({'Id' : 'post_id', 'OwnerUserId' : 'user_id'}), inplace=True)
posts.head()

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,1,1,15.0,2010-07-19 19:12:12,23,1278.0,<p>How should I elicit prior distributions fro...,8.0,2010-09-15 21:08:26,Eliciting priors from experts,...,5.0,1,14.0,,,,,,,
1,2,1,59.0,2010-07-19 19:12:57,22,8198.0,<p>In many different statistical methods there...,24.0,2012-11-12 09:21:54,What is normality?,...,7.0,1,8.0,88.0,2010-08-07 17:56:44,,,,,
2,3,1,5.0,2010-07-19 19:13:28,54,3613.0,<p>What are some valuable Statistical Analysis...,18.0,2013-05-27 14:48:36,What are some valuable Statistical Analysis op...,...,19.0,4,36.0,183.0,2011-02-12 05:50:03,2010-07-19 19:13:28,,,,
3,4,1,135.0,2010-07-19 19:13:31,13,5224.0,<p>I have two groups of data. Each with a dif...,23.0,2010-09-08 03:00:19,Assessing the significance of differences in d...,...,5.0,2,2.0,,,,,,,
4,5,2,,2010-07-19 19:14:43,81,,"<p>The R-project</p>\n\n<p><a href=""http://www...",23.0,2010-07-19 19:21:15,,...,,3,,23.0,2010-07-19 19:21:15,2010-07-19 19:14:43,3.0,,,


## Define new dataframes for users and posts with the following selected columns:

**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [12]:
users_columns = ['user_id', 'Reputation', 'Views', 'UpVotes', 'DownVotes']

new_users = users.loc[:,users_columns]
new_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


In [13]:
posts_columns = ['post_id', 'Score', 'user_id', 'ViewCount', 'CommentCount', 'Body']

new_posts = posts.loc[:, posts_columns]
new_posts.head()

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body
0,1,23,8.0,1278.0,1,<p>How should I elicit prior distributions fro...
1,2,22,24.0,8198.0,1,<p>In many different statistical methods there...
2,3,54,18.0,3613.0,4,<p>What are some valuable Statistical Analysis...
3,4,13,23.0,5224.0,2,<p>I have two groups of data. Each with a dif...
4,5,81,23.0,,3,"<p>The R-project</p>\n\n<p><a href=""http://www..."


**Note:** Check the new posts dataframe's info. What is the most noticeable change? 

Explain why we have chosen only some columns of it in terms of efficiency.

In [14]:
posts.info()        # memory usage: 14.7+ MB
new_posts.info()    # memory usage:  4.2+ MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91976 entries, 0 to 91975
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   post_id                91976 non-null  int64  
 1   PostTypeId             91976 non-null  int64  
 2   AcceptedAnswerId       14700 non-null  float64
 3   CreaionDate            91976 non-null  object 
 4   Score                  91976 non-null  int64  
 5   ViewCount              42921 non-null  float64
 6   Body                   91756 non-null  object 
 7   user_id                90584 non-null  float64
 8   LasActivityDate        91976 non-null  object 
 9   Title                  42921 non-null  object 
 10  Tags                   42921 non-null  object 
 11  AnswerCount            42921 non-null  float64
 12  CommentCount           91976 non-null  int64  
 13  FavoriteCount          13246 non-null  float64
 14  LastEditorUserId       44611 non-null  float64
 15  La

# Merge the new dataframes you have created, of users and posts. Create a dataframe called `posts_from_users`

You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 

Think carefully which should be the key(s) for your merging.

In [15]:
posts_from_users = pd.merge(left=new_users, right=new_posts, on='user_id')
posts_from_users.head()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...


In [16]:
posts_from_users.shape

(90883, 10)

## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [17]:
posts_from_users.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
90878    False
90879    False
90880    False
90881    False
90882    False
Length: 90883, dtype: bool

In [18]:
posts_from_users.duplicated().sum() 

299

## Find those duplicate values and try to understand what happened.

*Hint:* You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.

*Hint 2:* You can sort the values `by=['user_id', 'post_id']` to see them in order.


keep : {'first', 'last', False}, default 'first'
    Determines which duplicates (if any) to mark.

    - ``first`` : Mark duplicates as ``True`` except for the first occurrence.
    - ``last`` : Mark duplicates as ``True`` except for the last occurrence.
    - False : Mark all duplicates as ``True``.

In [19]:
# To visualize duplicated values in dataframe

mask = posts_from_users.duplicated(keep=False)   # keep=False --> Mark all duplicates as ``True`
posts_from_users.loc[mask,:].sort_values(by=['user_id', 'post_id'])

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
8396,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
8399,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
8397,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
8400,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
8398,760,168,13,13,0,23987,0,62.0,3,<p>I was studying on a PAMI article and I have...
...,...,...,...,...,...,...,...,...,...,...
90343,54711,4,18,0,0,114527,0,45.0,5,<p>From Shapiro-Wilk's test I see that the res...
90368,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
90369,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
90460,54911,1,1,0,0,113691,0,36.0,11,<p>I extract data related to a movie by sentim...


## Should you drop it? If you think it is reasonable to drop it, then drop it.

Think: How would you correct it in the first place? That is, what was wrong in the first place?

*Hint:* There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it. 

In [20]:
posts_from_users.drop_duplicates()

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


## 10. How many missing values do you have in your merged dataframe? On which columns?

In [21]:
posts_from_users.isna().sum()

user_id             0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
post_id             0
Score               0
ViewCount       48545
CommentCount        0
Body              220
dtype: int64

## Select only the rows in which there at least some missing values.

In [22]:
posts_from_users.isna().any(axis=1)  

0         True
1         True
2         True
3         True
4         True
         ...  
90878    False
90879    False
90880    False
90881    False
90882    False
Length: 90883, dtype: bool

In [23]:
mask = posts_from_users.isna().any(axis=1)
posts_from_users.loc[mask,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90817,55605,1,2,0,0,115106,0,,0,"<p>Recasting this as a time-to-event problem, ..."
90820,55609,1,1,0,0,115115,2,,0,"<p>This is my favourite:</p>\n\n<p>""To be sure..."
90827,55621,1,1,0,0,115213,0,,0,<p>Here is the part that explains answer to yo...
90835,55637,26,4,0,0,115170,1,,0,"<p>When you say class, I hope you mean 'output..."


In [24]:
mask_both_null = (posts_from_users['ViewCount'].isna()) & (posts_from_users['Body'].isna())
mask_any_null  = (posts_from_users['ViewCount'].isna()) | (posts_from_users['Body'].isna())

In [25]:
posts_from_users.loc[mask_both_null,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
5,-1,1,0,5007,1920,9857,0,,0,
7,-1,1,0,5007,1920,9860,0,,0,
8,-1,1,0,5007,1920,10130,0,,0,
...,...,...,...,...,...,...,...,...,...,...
34605,7290,37083,5554,8641,125,72983,0,,0,
34620,7290,37083,5554,8641,125,76603,0,,0,
34622,7290,37083,5554,8641,125,76631,0,,0,
34702,7290,37083,5554,8641,125,90803,0,,0,


In [26]:
posts_from_users.loc[mask_any_null,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90817,55605,1,2,0,0,115106,0,,0,"<p>Recasting this as a time-to-event problem, ..."
90820,55609,1,1,0,0,115115,2,,0,"<p>This is my favourite:</p>\n\n<p>""To be sure..."
90827,55621,1,1,0,0,115213,0,,0,<p>Here is the part that explains answer to yo...
90835,55637,26,4,0,0,115170,1,,0,"<p>When you say class, I hope you mean 'output..."


## You will need to make something with missing values.  Will you clean or filling them? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [27]:
mask= posts_from_users["user_id"] == -1
posts_from_users.loc[mask,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,,0,
2,-1,1,0,5007,1920,8578,0,,0,
3,-1,1,0,5007,1920,8981,0,,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
206,-1,1,0,5007,1920,107658,0,,0,
207,-1,1,0,5007,1920,109441,0,,0,
208,-1,1,0,5007,1920,112068,0,,0,
209,-1,1,0,5007,1920,112070,0,,0,


## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [28]:
posts_from_users.dtypes

user_id           int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
post_id           int64
Score             int64
ViewCount       float64
CommentCount      int64
Body             object
dtype: object

In [29]:
posts_from_users['ViewCount'].fillna(0)

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
90878    16.0
90879    40.0
90880    17.0
90881    13.0
90882     5.0
Name: ViewCount, Length: 90883, dtype: float64

In [30]:
posts_from_users['ViewCount'] = posts_from_users['ViewCount'].fillna(0)
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,2175,0,0.0,0,<p><strong>CrossValidated</strong> is for stat...
1,-1,1,0,5007,1920,8576,0,0.0,0,
2,-1,1,0,5007,1920,8578,0,0.0,0,
3,-1,1,0,5007,1920,8981,0,0.0,0,"<p>""Statistics"" can refer variously to the (wi..."
4,-1,1,0,5007,1920,8982,0,0.0,0,This generic tag is only rarely suitable; use ...
...,...,...,...,...,...,...,...,...,...,...
90878,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."
90879,55738,11,0,0,0,115360,2,40.0,4,<p>Is Student's t test a Wald test?</p>\n\n<p>...
90880,55742,6,0,0,0,115366,1,17.0,0,<p>Does any standard statistical software like...
90881,55744,6,1,0,0,115370,1,13.0,2,<p>im analyzing an article for my studies with...


# Bonus 1: (filtering) What is the average number of comments for users who are above the average reputation?

*Hint:* Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > {avg_reputation}` and etc.

*Hint 2:* You could create a variable based on that condition and use the group by function perform the task above.

In [31]:
# Calculate the average of the user Reputation
posts_from_users['Reputation'].mean()

6263.007812242114

In [35]:
# Store it in a variable called avg_reputation
avg_reputation = posts_from_users['Reputation'].mean()

condition = posts_from_users['Reputation'] > avg_reputation
posts_from_users.loc[condition,:]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
211,5,6792,1145,662,5,6,152,29229.0,5,"<p>Last year, I read a blog post from <a href=..."
212,5,6792,1145,662,5,12,20,0.0,1,"<p>See my response to <a href=""http://stackove..."
213,5,6792,1145,662,5,32,12,0.0,0,"<p>I recommend R (see <a href=""http://cran.r-p..."
214,5,6792,1145,662,5,49,6,0.0,0,<p>You don't need to install any packages beca...
215,5,6792,1145,662,5,64,6,0.0,0,"<p>Yes, there are many methods. You would nee..."
...,...,...,...,...,...,...,...,...,...,...
74959,32036,7227,991,2664,143,111822,3,0.0,2,<p>I would work on trying to get the bifactor ...
74960,32036,7227,991,2664,143,113074,1,0.0,0,<p>The Mann–Whitney–Wilcoxon test is the simpl...
74961,32036,7227,991,2664,143,113379,2,0.0,0,<p>Categorical variables have finite sets of d...
74962,32036,7227,991,2664,143,113527,2,0.0,7,<p>No. Age is a continuous variable and should...


# Bonus 2: (grouping) Group your dataframe by the Reputation of your user. Calculate the mean value of ViewCount and CommentCount for each reputation value.

Suppose the missing values on ViewCount are due a systemic error and you wanted to guess what values should have been there in the first place, but the system abended.

Would that be an interesting candidate for inputting the value for the missing `ViewCount` values? If so, input it with these values.

In [33]:
posts_from_users.groupby(by='Reputation',as_index=False).mean()

Unnamed: 0,Reputation,user_id,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount
0,1,34805.628697,1.513910,231.484337,88.744797,79686.514786,-0.029573,96.014020,1.496824
1,2,22155.000000,6.250000,0.000000,0.000000,53110.000000,-1.000000,438.916667,2.333333
2,3,35432.953229,2.091314,0.002227,0.000000,81728.997773,-0.037862,158.273942,1.681514
3,4,30921.875000,3.851562,0.000000,0.000000,70314.242188,-0.101562,212.132812,2.335938
4,5,33835.269231,8.230769,0.000000,0.000000,78640.230769,-0.115385,297.076923,2.115385
...,...,...,...,...,...,...,...,...,...
960,31170,930.000000,5529.000000,10523.000000,214.000000,10013.558952,7.550218,40.500000,1.668122
961,37083,7290.000000,5554.000000,8641.000000,125.000000,57828.455865,4.290206,8.718259,1.866989
962,44152,686.000000,7357.000000,2156.000000,82.000000,56618.879850,2.569462,8.951189,2.013767
963,65272,805.000000,5680.000000,7035.000000,143.000000,78081.503488,3.453488,2.654070,2.287209


## refs

Sample database used: https://relational.fit.cvut.cz/dataset/Stats

Stack-overflow database: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/
