# Manipulating Data with NumPy

In [None]:
y = [10,7,8,9]

In [None]:
x = [20, 40, 60, 80]
x  +  y

[20, 40, 60, 80, 10, 7, 8, 9]

## 1. Introduction to NumPy:

In [None]:
import numpy as np

In [None]:
# Example 1:


lst = [10, 200, 90, 0]

import numpy as np
# arr = np.array(lst)

lst

[10, 200, 90, 0]

In [None]:
x = np.array(lst)
x * 20

array([ 200, 4000, 1800,    0])

NumPy supports vectorized operations such as element-wise addition, substraction, etc which is not the case with basic lists.

In [None]:
# Example 2:

lst = [22, 45, 78, 50]


arr = np.array(lst)
print(arr)
print(arr.nbytes)

[22 45 78 50]
32


Python stores info for each pointer as well as to an object they are pointing to. However, NumPy array stores uniform values.

# 2. NumPy Object Creation:



![](https://i.imgur.com/mg8O3kd.png)


## 1. Array Creation:
A Num-Py nd-array inputs basic Python list as an argument along with an explicit datatype (for typecasting) as shown below:

In [None]:
x = np.array([3,4,5,6])
x

array([3, 4, 5, 6])

In [None]:
y = np.array([[1,2,3,4],[4,6,7,8]])
y

array([[1, 2, 3, 4],
       [4, 6, 7, 8]])

In [None]:

x = np.array([1,2,3,4])               # creates a 1-dimensional array
b = np.array([[1,2,3,4], [5,6,7,8.0]])    # creates a 2-dimensional array
print(a)
print('----')
print(b)

[1 2 3 4]
----
[[1. 2. 3. 4.]
 [5. 6. 7. 8.]]


## Useful Functions 

#### Shape :

In [None]:
a.shape

(4,)

In [None]:


print('The shape of the array a is ', x.shape)
print('The shape of the array b is ', y.shape)

The shape of the array a is  (4,)
The shape of the array b is  (2, 4)


#### Dimensions:

In [None]:
print('The dimensions of array a is ', a.ndim)
print('The dimensions of array b is ', b.ndim)

The dimensions of array a is  1
The dimensions of array b is  2


#### Size:

In [None]:
print('The size of the array a is ', a.size)
print('The size of the array b is ', b.size)

The size of the array a is  4
The size of the array b is  8


#### Datatype:

In [None]:
print('The datatype of the array a is ', a.dtype)
print('The datatype of the array b is ', b.dtype)

The datatype of the array a is  int64
The datatype of the array b is  float64


#### Itemsize:

In [None]:
print('The number of bytes in each element of the array a is  ', a.itemsize)
print('The number of bytes in each element of the array b is ', b.itemsize)

The number of bytes in each element of the array a is   8
The number of bytes in each element of the array b is  8


## 2. Matrix Creation:

Moving ahead let's learn creation of a matrix using NumPy. There are three methods:

**Method 1**: Using NumPy array to form a matrix.

**Method 2**: Using NumPy's inbuilt matrix function.

**Method 3**: Using miscellaneois functions such as zeros(), ones(), etc.

- **Method 1**: Using array and reshape to convert array into matrix

In [None]:
a = np.array([6,7,8,9,3,4,8,5,7]).reshape(3,3)
a

array([[6, 7, 8],
       [9, 3, 4],
       [8, 5, 7]])

In [None]:
print(np.array([5,6,8,45,12,52]).reshape(2,3))


- **Method II**: Using matrix function


In [None]:
np.matrix([[4,5,6],[7,8,9],[6,5,4]])

matrix([[4, 5, 6],
        [7, 8, 9],
        [6, 5, 4]])

In [None]:
print(np.matrix([[1,2],[3,4]]))


- **Method III**: Using misc. functions


In [None]:
print(np.eye(4)) # Identity matrix


[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


In [None]:
print( np.zeros( (4,3) ) )


[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [None]:
print(np.ones( (4,5), np.float64))


[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]


- **Useful Functions**:


In [None]:
np.arange(20,30, dtype='float64')

array([20., 21., 22., 23., 24., 25., 26., 27., 28., 29.])

In [None]:
# NumPy array from 1 to 19
print(np.arange(1,20,dtype='int32'))

In [None]:
# NumPy array from 1 to 19 with step size 2
print(np.arange(1,20,2,dtype='int8'))

[ 1  3  5  7  9 11 13 15 17 19]


#### Practice Exercise:

## 3. Indexing and Slicing:

### Indexing:

We now know how to create different types of NumPy arrays and check their features. But how about accessing a particular value or taking a chunk of values from the array itself? In this topic, we are going to discuss exactly that.

Like Python lists, the index starts at 0 for arrays as well. It follows the same pattern of array[start:stop: step]. Let us look at an example to observe this behaviour.

In [None]:
x = 'pyhton'
x[0]

'p'

In [None]:
a = np.array([9,8,7,6,5])
a[2]

7

In [None]:
a = np.array([[1,2,3],[4,5,6],[7,8,9],[9,7,4],[0,8,7]])
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9],
       [9, 7, 4],
       [0, 8, 7]])

In [None]:
a[2][2]

9

In [None]:
a[2:, 2:]

array([[9],
       [4],
       [7]])

In [None]:
a[0:3, 1:]

array([[2, 3],
       [5, 6],
       [8, 9]])

In [None]:
import numpy as np

In [None]:
a = np.array([[1,2,3],[4,5,6],[7,8,9],[8,6,5],[9,0,9]])
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9],
       [8, 6, 5],
       [9, 0, 9]])

In [None]:
a[2][0]

7

In [None]:
a[0:,1:2]

array([[2],
       [5],
       [8],
       [6],
       [0]])

In [None]:
a = np.array([[1,2,3],[4,5,6],[7,8,9]])

# Pull out second element of third row
print(a[2][1])
print('==========')
# Pull out first two rows and columns
print(a[:2,:2])
print('==========')
# Pull all elements of the third row
print(a[2,:])

### Integer Array Indexing:

Integer array indexing allows you to construct arbitrary arrays using the data from another array. Let us understand from the example

An example of integer array indexing:

In [None]:
a=np.array([[1,2],[3,4],[5,6]])

print(a[[0,1,2],[0,1,0]])
print('==========')

print(np.array([a[0,0],a[1,1],a[2,0]]))
print('==========')

print(a[[0,0],[1,1]])
print('==========')

print(np.array([a[0,1],a[0,1]]))

In [None]:
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9],
       [8, 6, 5],
       [9, 0, 9]])

In [None]:
a[[2],[1]]

array([8])

In [None]:
print(np.array([a[0,0],a[1,1],a[2,2]]))

[1 5 9]


Explanation: The print statements 1 & 2 and 3 & 4 yield the same results. In the first case, a[[0, 1, 2], [0, 1, 0]] essentially means we are indexing the value in first row-first column, second row-second column and third row-first column, which is the same as a[0, 0], a[1, 1], a[2, 0]]. Similarly, you should be able to deduce the logic behind the second case.

### Boolean Indexing:
This type is generally used for comparison purposes. A boolean index array is of the same shape as the array-to-be-filtered and it contains only True and False values.

Let's look at an example:

In [None]:
a = np.array([[4,7,1],[2,5,7],[7,1,1]])

# Boolean condition for values greater than 3
mask = a > 3
print(mask)

# Masking for the above boolean condition in the array
print(a[mask])

In [None]:
a = np.array([[4,7,1],[2,5,7],[7,1,1]])
a

array([[4, 7, 1],
       [2, 5, 7],
       [7, 1, 1]])

In [None]:
a > 3

array([[ True,  True, False],
       [False,  True,  True],
       [ True, False, False]])

#### Practice Exercise:

- Create an array with values 3, 4.5, 3 + 5j and 0. Check the values are real or imaginary.

Hint: Use functions .isreal(array) and .iscomplex(array)

## 4. Vectorization:

Vectorization is the ability of NumPy by which we can perform operations on entire arrays rather than on a single element.




In [None]:
#Example:

import numpy as np
arr = np.array([1,2,3,4,5,6,7])
print(arr[arr > 2])

The output of above code block will be "[3,4,5,6,7]" as it compares each element being greater than or less than 2.


In [None]:
arr = np.array([1,2,3,4,5,6,7])
arr

array([1, 2, 3, 4, 5, 6, 7])

In [None]:
arr[arr > 4]

array([5, 6, 7])

**Vectorized Operations**:



In [None]:
# Creating two arrays for operations

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[10,11,12],[13,14,15],[16,17,18]])


In [None]:
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [None]:
b

array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

In [None]:
a + b

array([[11, 13, 15],
       [17, 19, 21],
       [23, 25, 27]])

Addition :

In [None]:
print(a+b)


Substraction :

In [None]:
print(a-b)

Multiplication :

In [None]:
print(a*b)

[[ 10  22  36]
 [ 52  70  90]
 [112 136 162]]


Division :

In [None]:
print(a/b)

[[0.1        0.18181818 0.25      ]
 [0.30769231 0.35714286 0.4       ]
 [0.4375     0.47058824 0.5       ]]


Square Root Transformation : 

In [None]:
a = np.array([[1,4,9],[16,25,36]])
a

array([[ 1,  4,  9],
       [16, 25, 36]])

In [None]:
np.sqrt(a)

array([[1., 2., 3.],
       [4., 5., 6.]])

In [None]:
np.sqrt(625)

25.0

In [None]:
a = np.array([[1,4,9],[16,25,36]])
print(np.sqrt(a))

In [None]:
a = [3,7,7,8,9,3]

In [None]:
np.mean(a)

6.166666666666667

In [None]:
a = 7 
b = 8

np.add(a,b)

15

**These functions can also be as**:

1. np.add(a,b)
2. np.substract(a,b)
3. np.multiply(a,b)
4. np.divide(a,b)
5. np.sqrt(a)
6. np.log(a)

In [None]:
x = [5,8,9,4,3,9]

In [None]:
3, 4, 5, 8, 9, 9

In [None]:
np.sum(x)

38

In [None]:
np.cumsum(x)

array([ 5, 13, 22, 26, 29, 38])

In [None]:
np.mean(x)

6.333333333333333

In [None]:
np.median(x)

6.5

**Practice usage of following aggregate functions**:

1. a.sum()    Array-wise sum
2. a.min()    Array-wise minimum value
3. a.max(axis=0)    Maximum value of an array row
4. a.cumsum(axis=1)    Cumulative sum of the elements
5. a.mean()    Mean
6. np.median(a)    Median
7. np.corrcoef(a)    Correlation coefficient
8. np.std(a)    Standard deviation




In [None]:
np.std(a)

2.3392781412697

### Array Comparison and Understanding Axes Notation:

**Array Comparison**:

Entire array can be compared to another array using function np.array_equal


In [None]:
# Example:

a = np.array([5,6,7,8])
b = np.array([5,6,7,8])
print(np.array_equal(a,b))

It compares both size and elements of the array.


### Axes Notation:

Axes refer to a particular dimension of a multidimensional array. Axis determines whether the action will be performed row-wise, column-wise or on the whole array.

Let's see an example:

In [None]:
a = np.array([[2,5,7],[4,25,30],[7,55,3]])
a

array([[ 2,  5,  7],
       [ 4, 25, 30],
       [ 7, 55,  3]])

In [None]:
a.sum()

138

In [None]:
a.sum(axis=0)

array([14, 59, 65])

In [None]:
a.sum(axis=0)

array([13, 85, 40])

In [None]:
a.min(axis=0)

array([2, 5, 3])

In [None]:
a = np.array([[2,5,7],[4,25,30]])

# computes sum over columns
print(a.sum(axis=0))
print('==========')

# computes sum over rows
print(a.sum(axis=1))
print('==========')

# computes total sum
print(a.sum())

### Practice Exercise:

We have a buy sell problem:

Initialize an array [[40, 35, 20], [21, 48, 70]] which constitutes the prices on 2 consecutive day at 3 different sessions of the day. The objective is to buy at minimum price on day 1 and sell at maximum on day.

1. Find the minimum price on day 1.
2. Find the maximum price on day 2.
3. Calculate the profit and print it.
4. Find the index of the session on which buying and selling took place.

### Hint:

1. Use array.min() and array.max() to find the minimum and maximum prices.
2. Filter arrays for different days using array[0:] and array[1:]
3. Index = list(array).index(value)

## Step 1 : Read the Data! 

<Please add the file>

In [None]:
data_file ='makeSenseOfCensus.csv'

### Loading the data

In [None]:
data = np.genfromtxt(data_file, delimiter=",", skip_header=1)

# printing the data

print("\nData: \n\n", data)

#### Printing the type of data

In [None]:
print("\nType of data: \n\n", type(data))

## Step 2 : Append the Data

#### Append 'new_record' (given) to 'data' using "np.concatenate()"¶


In [None]:
new_record=[[50, 9, 4, 1, 0, 0, 40, 0]]

## Step 3 : Check if it's a young country or old country

- Create a new array called 'age' by taking only age column(age is the column with index 0) of 'census' array.

- Find the max age and store it in a variable called 'max_age'.

- Find the min age and store it in a variable called 'min_age'.

- Find the mean of the age and store it in a variable called 'age_mean'.

- Find the standard deviation of the age and store it in a variable called 'age_std'

## Step 4: Let's check the country's race distribution to identify the minorities



- Create four different arrays by subsetting 'census' array by Race column(Race is the column with index 2) and save them in 'race_0','race_1', 'race_2', 'race_3' and 'race_4' respectively(Meaning: Store the array where 'race'column has value 0 in 'race_0', so on and so forth)

- Store the length of the above created arrays in 'len_0', 'len_1','len_2', 'len_3' and 'len_4' respectively

- Find out which is the race with the minimum no. of citizens

- Store the number associated with the minority race in a variable called 'minority_race'(For eg: if "len(race_5)" is the minimum, store 5 in 'minority_race' because that is the index of the race having the least no. of citizens )

## Step 5: As per govt. records citizens above 60 should not work more than 25 hours a week. Let us check if the policy is in place



- Create a new subset array called 'senior_citizens' by filtering 'census' according to age>60 (age is the column with index 0)

- Add all the working hours(working hours is the column with index 6) of 'senior_citizens' and store it in a variable called 'working_hours_sum'

- Find the length of 'senior_citizens' and store it in a variable called 'senior_citizens_len'

- Finally find the average working hours of the senior citizens by dividing 'working_hours_sum' by 'senior_citizens_len' and store it in a variable called 'avg_working hours'.

- Print 'avg_working_hours' and see if the govt. policy is followed.

## Step 6: Let's check that higher educated people have better pay in general.



- Create two new subset arrays called 'high' and 'low' by filtering 'census' according to education-num>10 and education-num<=10 (education-num is the column with index 1) respectively.

- Find the mean of income column(income is the column with index 7) of 'high' array and store it in 'avg_pay_high'. Do the same for 'low' array and store it's mean in 'avg_pay_low'.



### Project :

Let us build a complete project using NumPy (without any help).

#### Path: project_data = 'KAG_Conversion_Data.csv'

#### Features:

1. ad_id:    unique ID for each ad
2. xyzcampaignid:    an ID associated with each ad campaign of XYZ company
3. fbcampaignid:    an ID associated with how Facebook tracks each campaign
4. age:    age of the person to whom the ad is shown
5. gender:    gender of the person to whom the add is shown
6. interest:    a code specifying the category to which the person’s interest belongs (interests are as mentioned in the person’s Facebook public profile)
7. Impressions:    the number of times the ad was shown
8. Clicks:    number of clicks on for that ad
9. Spent:    Amount paid by company xyz to Facebook, to show that ad
10. Total conversion:    Total number of people who enquired about the product after seeing the ad
11. Approved conversion:    Total number of people who bought the product after seeing the ad

#### Instructions:

- Load the data. Data is already given to you in variable path

- How many unique ad campaigns (xyzcampaignid) does this data contain ? And for how many times was each campaign run ?

- What are the age groups that were targeted through these ad campaigns?

- What was the average, minimum and maximum amount spent on the ads?

- What is the id of the ad having the maximum number of clicks ?

- How many people bought the product after seeing the ad with most clicks? Is that the maximum number of purchases in this dataset?

- So the ad with the most clicks didn't fetch the maximum number of purchases. Find the details of the product having maximum number of purchases