# Project: Jupyter Notebook


## Keyboard Shortcuts I

In [None]:
welcome_message = 'Hello, Jupyter!'
first_cell = True

if first_cell:
    print(welcome_message)
    
print('First cell')

In [None]:
result = 1200 / 5
second_cell = True

if second_cell:
    print(result)
    
print('Second cell')

In [None]:
print('A true third cell')

## State

In [None]:
def welcome(a_string):
    print('Welcome to ' + a_string + '!')
    
dq = 'Dataquest'
jn = 'Jupyter Notebook'
py = 'Python'

In [None]:
welcome(dq)
welcome(jn)
welcome(py)

## Hidden State

In [None]:
%history -p

In [None]:
# Restart & Clear Output

In [None]:
'''
Note: To reproduce exactly the output in this notebook
as whole:

1. Run all the cells above.
2. Restart the program's state but keep the output
(click Restart Kernel).
3. Then, run only the cells below.


(You were not asked in this exercise to write a note like this.
The note above was written to give more details on how to reproduce
the behavior seen in this notebook.)
'''

In [None]:
%history -p

In [None]:
def welcome(a_string):
    welcome_msg = 'Welcome to ' + a_string + '!'
    return welcome_msg

dq = 'Dataquest'
jn = 'Jupyter Notebook'

In [None]:
welcome(dq)
welcome(jn)
welcome(py)

In [None]:
%history -p

In [None]:
welcome(dq)
welcome(jn)
welcome(py)

## Markdown Syntax

In the code cell below, we:

- Open the `AppleStore.csv` file using the `open()` function, and assign the output to a variable named `opened_file`
- Import the `reader()` function from the `csv` module
- Read in the opened file using the `reader()` function, and assign the output to a variable named `read_file`
- Transform the read-in file to a list of lists using `list()` and save it to a variable named `apps_data`
- Display the header row and the first three rows of the data set.

The data set above contains information about more than 7000 Apple iOS mobile apps. The data was collected from the iTunes Search API by data engineer [Ramanathan Perumal](https://www.kaggle.com/ramamet4). Documentation for the data set can be found [at this page](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home), where you'll also be able to download the data set.

This is a table explaining what each column in the data set describes:

Column name | Description
-- | --
"id" | App ID
"track_name"| App Name
"size_bytes"| Size (in Bytes)
"currency"| Currency Type
"price"| Price amount
"rating_count_tot"| User Rating counts (for all version)
"rating_count_ver"| User Rating counts (for current version)
"user_rating" | Average User Rating value (for all version)
"user_rating_ver"| Average User Rating value (for current version)
"ver" | Latest version code
"cont_rating"| Content Rating
"prime_genre"| Primary Genre
"sup_devices.num"| Number of supporting devices
"ipadSc_urls.num"| Number of screenshots showed for display
"lang.num"| Number of supported languages
"vpp_lic"| Vpp Device Based Licensing Enabled

In [None]:
from csv import reader

### The Google Play data set ###
# TO prevent UnicodeDecodeError, add encoding="utf8" inside open
opened_file = open('googleplaystore.csv', encoding="utf8") # T
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [None]:
### The App Store data set ###
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
print(android_header)

In [None]:
explore_data(android, 0, 3, True)

- google play has around 10841 apps and columns
- most important columns include `'App'`, `'Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

In [None]:
print(ios_header)

In [None]:
explore_data(ios, 0,4, True)

- we have 7197 apps in the ios data set
- some of the interesting columns include 'track_name', 'currency', 'price', 'rating_count_tot', 'user_rating', 'prime_genre'

- Details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

## Deleting Wrong Data
- The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion)
- one of the discussions outlines an error at row 10472
- let's find this error and compare it with the header


In [None]:
print(android[10472]) # incorrect row

In [None]:
print(android_header) # header

In [None]:
print(android[0]) # correct row

- the row corresponds to app _Life Made WI-Fi Touchscreen Photo Frame_ where the rating is 19
- it can't be possible as the max rating is 5
- this error is caused by a missing value in the `'category'` column


In [None]:
for row in android[1:]:
    if len(row) != len(android_header):
        print(row)
        print("\n")
        print("Index postion is:", android.index(row))

In [None]:
print(len(android))
del android[10472]  # run this just once
print(len(android))

## Removing Duplicate Entries

### Part One
we can observe that in Google Play data some apps have more than one entry.
For instance, the application Instagram has four entries:

In [None]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

In [None]:
duplicate_apps = []
unique_apps = []
for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print("Examples of duplicate apps: ", duplicate_apps[:15])  # Print first 15 duplicates</span>


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app.

We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

### Part Two

Let's start by building the dictionary.

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. 

We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set
- we only select the apps with the highest number of reviews

### Part Two

Let's start by building the dictionary.

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

### Part Two

Let's start by building the dictionary.

In [None]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    


we found that there are 1,181 cases where an app occurs more than once
- so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [None]:
print('Expected length: ', len(android) - 1181)
print('actual length: ', len(reviews_max))

We start by initializing two empty lists, `android_clean` and `already_added`.
- We loop through the `android` data set, and for every iteration:
    - We isolate the name of the app and the number of reviews.
    - We add the current row (`app`) to the `android_clean` list, and the app name (`name`) to the `already_added` list if:
        - The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
        - The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [None]:

android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if reviews_max[name] == n_reviews and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)




In [None]:
len(android_clean)

In [None]:
explore_data(android_clean, 0, 2, True)

We have 9659 rows, just as expected.

## Removing Non-English Apps

### Part One

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience.

In [None]:
print(ios[813][1])
print(ios[6731][1])

In [None]:
def is_english(word):
    """Check if a word is in English"""
    for char in word:
        if ord(char) > 127:
            return False
    return True    


In [None]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

In [None]:
print(ord('™'))
print(ord('😜'))