# Data Reading

<p>In this we show how to read and split the data into a training and test set where all users have at least 5 reviews</p>

In [1]:
#Read all the initial libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Parameters
minNumberOfReviews = 5
percentTrainingData = 80
seed = 42

### Read Data
<p>First we read all the data into a pandas frame</p>

In [9]:
fileData = pd.read_csv('data/ratings_Books.csv', names=["userId", "itemId", "rating", "time"])
fileData

Unnamed: 0,userId,itemId,rating,time
0,AH2L9G3DQHHAJ,0000000116,4.0,1019865600
1,A2IIIDRK3PRRZY,0000000116,1.0,1395619200
2,A1TADCM7YWPQ8M,0000000868,4.0,1031702400
3,AWGH7V0BDOJKB,0000013714,4.0,1383177600
4,A3UTQPQPM4TQO0,0000013714,5.0,1374883200
5,A8ZS0I5L5V31B,0000013714,5.0,1393632000
6,ACNGUPJ3A3TM9,0000013714,4.0,1386028800
7,A3BED5QFJWK88M,0000013714,4.0,1350345600
8,A2SUAM1J3GNN3B,0000013714,5.0,1252800000
9,APOZ15IEYQRRR,0000013714,5.0,1362787200


### Filter out users with fewer than 5 reviews
<p>We group users to their reviews and remove users with fewer than 5 reviews</p>

In [3]:
groupedData = fileData.groupby(['userId'])
sortedData = groupedData.size()
sortedData = sortedData[sortedData >= minNumberOfReviews]

<p>We remove all reviews which does not have a user in our filtered list</p>

In [4]:
userIdsToKeep = sortedData.index
sortedData = fileData[fileData['userId'].isin(userIdsToKeep)]

### Split data into training and test sets
<p>We now split our data into a training (data) and test (testData) set</p>

In [6]:
elements = int(round(percentTrainingData/100 * sortedData.shape[0]))
data = sortedData.sample(elements,random_state=seed)
testData = sortedData.drop(data.index)

data

Unnamed: 0,userId,itemId,rating,time
22127907,A2XJG3XJ1XY6H2,B00I6ZXOHM,4.0,1394496000
22078635,A2ALBZAFLT4LN0,B00HVQ5S1W,5.0,1393286400
1008711,ARHLQBP7EH28W,006230240X,5.0,1359072000
22003959,A1QTIE77Y5WG5N,B00HEXWHMU,5.0,1397433600
9485250,A3F78GMKHPZ9UA,0802406289,3.0,1372118400
1853116,A2DG1H013DN01P,0156004801,5.0,1377302400
22024369,A3PTX9PMX4QGF4,B00HIIGU8S,5.0,1396828800
15546971,A2FQF8C8IGXKC2,1492910759,4.0,1392940800
4055208,A17C9PWK1XD9KL,0373772955,5.0,1368835200
8050976,ADGR3XB512CM1,0740761811,5.0,1370390400


In [7]:
testData

Unnamed: 0,userId,itemId,rating,time
22,A23PISU0ZLW71C,0000029831,5.0,1393200000
59,A28X5I7TL8BAOH,0000913154,5.0,1319328000
82,A2WVHIRDMLM82E,000100039X,5.0,1394928000
85,A19N3FCQCLJYUA,000100039X,5.0,1358899200
101,A27ZH1AQORJ1L,000100039X,5.0,1066003200
102,A5E9TSD20U9PR,000100039X,5.0,1377475200
128,A1NPNGWBVD9AK3,000100039X,5.0,961804800
132,A13C50JF143I83,000100039X,5.0,1396742400
141,A15ACUAJEJXCS3,000100039X,5.0,957312000
149,A3NPACKJFKAZIE,000100039X,5.0,1403049600


### Data frames in total
<p>We now have the following tables to look up in</p>
* fileData - the original txt file
* groupedData - contains user's grouped to their items
* sortedData - sorted data with both training and test data
* data - training data
* testData - test data



<p>Below we show a few useful methods for accessing the data</p>

<b>Find user's reviews</b>

In [8]:
groupedData.get_group('ACR4T36M4FSW')

Unnamed: 0,userId,itemId,rating,time
890170,ACR4T36M4FSW,0062107763,5.0,1399420800
2095812,ACR4T36M4FSW,0263852547,5.0,1396137600
2098970,ACR4T36M4FSW,0263877957,5.0,1382140800
2099859,ACR4T36M4FSW,0263899233,5.0,1384041600
2103464,ACR4T36M4FSW,0263907546,4.0,1403827200
2103594,ACR4T36M4FSW,0263912566,4.0,1396137600
3822774,ACR4T36M4FSW,0345528891,5.0,1388448000
3956651,ACR4T36M4FSW,0373066627,5.0,1390608000
3972362,ACR4T36M4FSW,037318073X,5.0,1387670400
4031438,ACR4T36M4FSW,0373657684,5.0,1385164800
