# Data Wrangling in Python  
*Using Itertools with the __MovieLens__ dataset*  

**Part 2: Playing with Itertools**  
  
![Playing with Itertools](./../images/data_munging_00-Python-Collections-02.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/00-Python-Collections/01.02%20Playing%20with%20Itertools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

In [2]:
datalocation = "./../data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

In [4]:
import itertools
from itertools import permutations
import io
import time
print(io.DEFAULT_BUFFER_SIZE)

dir(itertools)

8192


['__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_grouper',
 '_tee',
 '_tee_dataobject',
 'accumulate',
 'chain',
 'combinations',
 'combinations_with_replacement',
 'compress',
 'count',
 'cycle',
 'dropwhile',
 'filterfalse',
 'groupby',
 'islice',
 'pairwise',
 'permutations',
 'product',
 'repeat',
 'starmap',
 'takewhile',
 'tee',
 'zip_longest']

In [5]:
%%timeit
count = 0
start = time.time()
for line in open(file_path_movies, 'r', buffering=100000000, encoding='utf-8'):
	count += 1
total = time.time() - start

2.45 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%%timeit
count = 0
start = time.time()
for line in open(file_path_movies, 'r', buffering=1000000, encoding='utf-8'):
	count += 1
total = time.time() - start

2.32 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
%%timeit
count = 0
start = time.time()
for line in open(file_path_movies, 'r', buffering=1000, encoding='utf-8'):
	count += 1
total = time.time() - start

2.97 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
%%timeit
count = 0
start = time.time()
for line in open(file_path_movies, 'r', buffering=-1, encoding='utf-8'):
	count += 1
total = time.time() - start

3.58 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
%%timeit
count = 0
start = time.time()
for line in open(file_path_movies, 'r', encoding='utf-8'):
	count += 1
total = time.time() - start

2.04 ms ± 40.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [42]:
%%timeit

from collections import Counter
lst = [len(l) for l in open(file_path_movies, 'r', encoding='utf-8', buffering = 1000)]
# lst2 = [l for l in open(file_path_movies, 'r', encoding='utf-8')]
count = Counter(lst)

2.31 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [40]:
count

Counter({43: 327,
         44: 321,
         38: 312,
         39: 300,
         42: 296,
         45: 295,
         40: 293,
         46: 287,
         41: 284,
         47: 284,
         50: 282,
         48: 275,
         49: 262,
         37: 258,
         51: 253,
         33: 228,
         35: 226,
         36: 217,
         53: 213,
         52: 208,
         34: 201,
         54: 192,
         56: 189,
         55: 187,
         32: 181,
         31: 180,
         57: 165,
         30: 152,
         58: 147,
         60: 145,
         59: 140,
         29: 133,
         62: 120,
         61: 120,
         63: 108,
         64: 106,
         66: 96,
         65: 89,
         28: 87,
         67: 84,
         69: 83,
         68: 73,
         27: 73,
         73: 68,
         26: 66,
         71: 66,
         70: 65,
         72: 64,
         74: 63,
         76: 61,
         77: 57,
         75: 52,
         25: 50,
         24: 41,
         82: 37,
         78: 34,
         81:

# Next

We look at itertools and functools