# NumPy

**Numeric Python** is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

The First Rule of NumPy

In [None]:
import numpy as np

The Second Rule of NumPy

In [None]:
# You don't need cycles

first_arr = np.array([1, 2, 3, 4, 5])
second_arr = np.copy(first_arr)

# Instead of
for i in range(len(arr)):
    if first_arr[i] == 3 or first_arr[i] == 4:
        first_arr[i] = 0
# Do
second_arr[(second_arr == 4) | (second_arr == 3)] = 0

assert((first_arr == second_arr).all())

Array creation

In [None]:
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 8])
print(type(a))
print(a.shape)
print(a.dtype)

Assigning and appending an element to an array

In [None]:
a[8] = 9
a = np.append(a, 10)
print(a)

Standard Python slicing syntax

In [None]:
print(a)
print('\n')

print(a[0:5])
print(a[0:5:2])
print(a[0:-1])
print(a[4::-1])
print(a[5:0:-2])

In-built calculation of different statistics

In [None]:
print('Vector max %d, min %d, mean %.2f, median %.2f, stardard deviation %.2f and total sum %d' %
      (a.max(), np.min(a), a.mean(), np.median(a), a.std(), a.sum()))

Filtering on condition (masking)

In [None]:
a[a > a.mean()]

Sorting

In [None]:
# Sorted array
print(np.sort(a))

# Order of indices in sorted array
print(np.argsort(a))

Vector operations

In [None]:
               a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
a * b
a - b
a + b

2D arrays (matrices)

In [None]:
m_a = np.array([[1, 2, 3, 4]
                ,[13, 3, 8, 2]
                ,[8, 7, 2, 3]])
print(m_a.shape)

Statistics calculation

In [None]:
print(m_a.max())
print(m_a.max(axis=0))
print(m_a.max(axis=1))

Sorting

In [None]:
print(np.sort(m_a, axis=0))
print(np.sort(m_a, axis=1))

# Intro to Pandas data structures

In [None]:
import pandas as pd

## Series

[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [None]:
s = pd.Series(data=[39.4, 91.2, 80.5, 20.3, 4.2, -13.4]
              ,index=['first', 'second', 'second', 'third', 'forth', 'fifth'])
print(type(s))
print(s.shape)
print(s.dtype)
print(s['second'])

In [None]:
s = pd.Series(data=[39.4, 91.2, 20.3, 4.2, -13.4])
s

Series acts very similarly to a [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html), and is a valid argument to most [NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html) functions. However, operations such as slicing will also slice the index.

In [None]:
s[1:4]

In-built statistics calculation is the same as in NumPy

In [None]:
np.max(s)
s.min()
s.std()

Vector operations

In [None]:
a = pd.Series([1, 2, 3])
b = pd.Series([2, 3, 4])

a * 2
a + b
a - b
a * b

## DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

DataFrame creation

In [None]:
d = {"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
     "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
   }
df = pd.DataFrame(d)
df

In [None]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
df

### Basic transformations

Loading existing DataFrame from csv file. We'll use Google Play Store Apps dataset from here https://www.kaggle.com/lava18/google-play-store-apps.

In [None]:
data = pd.read_csv('googleplaystore.csv')
data.head(3)

Column types

In [None]:
data.dtypes

Column selection as DataFrame

In [None]:
data[['App', 'Rating']].head()

Column selection as Series. You can treat a DataFrame semantically like a dict of like-indexed Series objects.

In [None]:
data['Rating'].head()

In [None]:
type(data['Rating'])

In [None]:
data['Rating'].mean()

Row selection by index

In [None]:
data.iloc[1]

Filtering (row selection by condition)

In [None]:
data[data['Rating'] < 3].head()

Assigning value to column based on condition

In [None]:
# Not correct
data[data['Rating'] < 3]['Rating'] = 0

In [None]:
data[data['Rating'] < 3].head(3)

In [None]:
# Correct
data.loc[data['Rating'] < 3, 'Rating'] = 0

In [None]:
data[data['Rating'] < 3].head(3)

Sorting

In [None]:
data.sort_values(by='Rating').head(3)

Selecting values

In [None]:
data.head()['App'].values.tolist()

**All together**. Make list of top-5 Free Apps by Rating in Education Genres by alphabet order

In [None]:
data[(data['Type'] == 'Free') & (data['Genres'] == 'Education')]\
    .sort_values(by=['Rating', 'App'], ascending=(False, True))\
    .head(5)['App'].values.tolist()

### Concatenating

Appending DataFrames

In [None]:
df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"],},
    index=[0, 1, 2, 3],)

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"],
    "C": ["C4", "C5", "C6", "C7"],
    "E": ["E4", "E5", "E6", "E7"],},
    index=[0, 1, 2, 3],)

In [None]:
df1

In [None]:
df2

In [None]:
df1.append(df2, ignore_index=True)

**Join** methon works better with joining DataFrame by indices and is fine-tuned by default to do it.

In [None]:
df3 = pd.DataFrame({
    "A": ["A1", "A2", "A3", "A4"],
    "F": ["F0", "F1", "F2", "F3"],},
    index=[0, 1, 2, 3],)

In [None]:
df1

In [None]:
df3

In [None]:
df1.join(df3, how='inner', lsuffix='_first', rsuffix='_third')

**Merge** method is more versatile and allows us to specify columns besides the index to join on for both dataframes.

In [None]:
df1.merge(df3, how='left', on=['A'])

### Grouping

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups

In [None]:
data.head(3)

Find _Category_ having the highest average rating among it's applications. No cycles, I promise.

In [None]:
data.groupby('Category')['Rating'].mean().sort_values(ascending=False).index[1]