***What is NumPy?***

Numpy is an open-source library for working efficiently with arrays. Developed in 2005 by Travis Oliphant, the name stands for Numerical Python. As a critical data science library in Python, many other libraries depend on it.

(https://www.learndatasci.com/tutorials/applied-introduction-to-numpy-python-tutorial/)

***Some of Numpy's advantages:***

1. Mathematical operations on NumPy’s ndarray objects are up to 50x faster than iterating over native Python lists using loops. The efficiency gains are primarily due to NumPy storing array elements in an ordered single location within memory, eliminating redundancies by having all elements be the same type and making full use of modern CPUs. The efficiency advantages become particularly apparent when operating on arrays with thousands or millions of elements, which are pretty standard within data science.

2. It offers an Indexing syntax for easily accessing portions of data within an array.

3. It contains built-in functions that improve quality of life when working with arrays and math, such as functions for linear algebra, array transformations, and matrix math.

4. It requires fewer lines of code for most mathematical operations than native Python lists.

What's the relationship between NumPy, SciPy, Scikit-learn, and Pandas?

<b>NumPy</b> provides a foundation on which other data science packages are built, including SciPy, Scikit-learn, and Pandas.

<b>SciPy</b> provides a menu of libraries for scientific computations. It extends NumPy by including integration, interpolation, signal processing, more linear algebra functions, descriptive and inferential statistics, numerical optimizations, and more.

<b>Scikit-learn</b> extends NumPy and SciPy with advanced machine-learning algorithms.

<b>Pandas</b> extends NumPy by providing functions for exploratory data analysis, statistics, and data visualization. It can be thought of as Python's equivalent to Microsoft Excel spreadsheets for working with and exploring tabular data (tutorial).

The NumPy array - an n-dimensional data structure - is the central object of the NumPy package.


A one-dimensional NumPy array can be thought of as a <b><font color='red'>vector</font></b>, a two-dimensional array as a <b><font color='red'>matrix</font></b> (i.e., a set of vectors), and a three-dimensional array as a <b><font color='red'>tensor<font></b> (i.e., a set of matrices).

e.g.
Vector -> np.array[1,2])

Matrix -> np.array([[1,2],[3,4]])

3D Matrix -> np.array([[[1,2],[3,4]],
                      [[5,6],[7,8]],
                      [[9,10],[11,12]]])


***Array data types***

An array can consist of integers, floating-point numbers, or strings. Within an array, the data type must be consistent (e.g., all integers or all floats).

Need an array with mixed data types? Consider using Numpy's record array format or pandas dataframes instead (see the Pandas tutorial).


In this article, we'll restrict our focus to conventional NumPy arrays consisting of a single data type.

<font color = 'yellow'>Using np.array()</font>

To define an array manually, we can use the np.array() function. Below, we pass a list of two elements, each of which is a list containing two values. The result is a 2x2 matrix:

In [2]:
import numpy as np
import pandas as pd


In [3]:
np.array([[1,3],[4,5]])

array([[1, 3],
       [4, 5]])

NumPy has numerous functions for generating commonly-used arrays without having to enter the elements manually. A few of those are shown below:

<font color ='yellow'>Defining arrays: np.arange()</font>
    
The function np.arange() is great for creating vectors easily. Here, we create a vector with values spanning 1 up to (but not including) 5:

In [5]:
np.arange(1,5)

array([1, 2, 3, 4])

<b> Defining arrays: np.zeros, np.ones, np.full</b>
    
In many programming tasks, it can be useful to initialize a variable and then write a value to it later in the code. If that variable happens to be a NumPy array, a common approach would be to create it as an array with zeros in every element. We can do this using <b>np.zeros().</b> Here, we create an array of zeros with three rows and one column.

In [5]:
np.zeros([1,3])

array([[0., 0., 0.]])

In [6]:
np.zeros([3,1])

array([[0.],
       [0.],
       [0.]])

<b>np.full()</b> creates an array repeating a fixed value (defaults to zero). Here we create a 2x3 array with the number 7 in each element:


In [7]:
np.full((3,3),7)

array([[7, 7, 7],
       [7, 7, 7],
       [7, 7, 7]])

In [9]:
np.ones([1,3])

array([[1., 1., 1.]])

In [10]:
np.ones([3,3])

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

<b>Array shape</b>

All arrays have a shape accessible using <i>.shape</i>

For example, let's get the shape of a vector, matrix, and tensor.

In [18]:
vector = np.arange(5)
print("Vector shape:",vector.shape)
print(vector)

matrix = np.ones([3,2])
print("Matrix shape:",matrix.shape)
print(matrix)
tensor = np.zeros([2,3,3])
print("Tensor shape:",tensor.shape)
print(tensor)

Vector shape: (5,)
[0 1 2 3 4]
Matrix shape: (3, 2)
[[1. 1.]
 [1. 1.]
 [1. 1.]]
Tensor shape: (2, 3, 3)
[[[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]]

 [[0. 0. 0.]
  [0. 0. 0.]
  [0. 0. 0.]]]


The shape of the vector is one-dimensional. The first number in its shape is the number of elements (or rows). For the matrix, .shape tells us we have three rows and two columns. The tensor is slightly different. The first number is how many matrices/slices we have. The second gives the number of rows. The third provides the number of columns.

If you're familiar with pandas, you might have noticed that the syntax for the number of rows and columns is strikingly similar to the equivalent in pandas. As we continue to explore NumPy arrays, you may notice many more similarities.

<b>NumPy arange()</b> is one of the array creation routines based on numerical ranges. It creates an instance of ndarray with evenly spaced values and returns the reference to it.

The arange([start,] stop[, step,][, dtype]) : Returns an array with evenly spaced elements as per the interval. The interval mentioned is half-opened i.e. [Start, Stop) 

In [20]:
np.arange(start=1,stop=10,step=3)

array([1, 4, 7])

In this example, start is 1. Therefore, the first element of the obtained array is 1. step is 3, which is why your second value is 1+3, that is 4, while the third value in the array is 4+3, which equals 7.

In [21]:
#you can pass start, stop, and step as positional arguments as well:
np.arange(1,10,3)

array([1, 4, 7])

<b>Reshaping arrays</b>

We can reshape an array into any compatible dimensions using <i>.reshape</i>

In [3]:
arr = np.arange(1,10)
print(arr,'\n')

[1 2 3 4 5 6 7 8 9] 



In [6]:
#reshape
arr = arr.reshape(3,3)
print(arr,'\n')

[[1 2 3]
 [4 5 6]
 [7 8 9]] 



In [7]:
# Reshape back to the original size
arr = arr.reshape(9)
print(arr,'\n')

[1 2 3 4 5 6 7 8 9] 



Numpy can try to infer one of the dimensions if you use -1. You will still need to have precisely the correct number of digits for the inference to work.

In [23]:
arr = np.arange(1,10).reshape(3,-1)
arr

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

<b>Reading data from a file into an array</b>

Usually, data sets are too large to define manually. Instead, the most common use case is to import data from a data file into a NumPy array.

In [8]:
import csv

In [9]:
data = []
with open(r'C:\Users\srini\Downloads\nba3.csv') as csvfile:
    file_reader = csv.reader(csvfile,delimiter=',')
    for row in file_reader:
        data.append(row)
data = np.array(data)#convert the list of lists  to numpy array
data.shape
    

(458, 5)

In [33]:
data.dtype.type

numpy.str_

We now have our data stored in a NumPy array that we've named data. For much of the remainder of this article, we'll be exploring how NumPy's functionality can be used to manipulate and gain insights into this data.

<b>Saving</b>

When we are ready to save our data, we can use the save function.

In [35]:
np.save(open(r'C:\Users\srini\Downloads\new_nba_numpy.npy','wb'),data)

<b>Indexing</b>

At some point, it will become necessary to index (select) subsets of a NumPy array. For instance, you might want to plot one column of data or perform a manipulation of that column. NumPy uses the same indexing notation as MATLAB.

Commas separate axes of an array.

Colons mean "through". For example, x[0:4] means the first 5 rows (rows 0 through 4) of x.

Negative numbers mean "from the end of the array." For example, x[-1] means the last row of x.

Blanks before or after colons means "the rest of". For example, x[3:] means the rest of the rows in x after row 3. Similarly, x[:3] means all the rows up to row 3. x[:] means all rows of x.

When there are fewer indices than axes, the missing indices are considered complete slices. For example, in a 3-axis array, x[0,0] means all data in the 3rd axis of the 1st row and 1st column.

Dots "..." mean as many colons as needed to produce a complete indexing tuple. For example, x[1,2,...] is the same as x[1,2,:,:,:].

In [11]:
data=[]
with open(r"C:\Users\srini\Downloads\MER_T07_02A.csv") as csvfile:
    file_reader = csv.reader(csvfile,delimiter=',')
    for row in file_reader:
        data.append(row)
        
data1 = np.array(data)

In [5]:
data1

array([['MSN', 'YYYYMM', 'Value', 'Column_Order', 'Description', 'Unit'],
       ['CLETPUS', '194913', '135451.32', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ['CLETPUS', '195013', '154519.994', '1',
        'Electricity Net Generation From Coal, All Sectors',
        'Million Kilowatthours'],
       ...,
       ['ELETPUS', '202305', '327858.999', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202306', '357140.257', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours'],
       ['ELETPUS', '202307', '425654.556', '13',
        'Electricity Net Generation Total (including from sources not shown), All Sectors',
        'Million Kilowatthours']], dtype='<U80')

In [12]:
data1.shape

(8854, 6)

Another property of a NumPy array that we may wish to know is its data type. This information is stored in the dtype attribute. Calling dtype reveals that our array is made up of strings:



In [7]:
data1.dtype.type

numpy.str_

In [8]:
np.save(open(r"C:\Users\srini\Downloads\MER_T07_02A.npy","wb"),data1)

In [10]:
data1[0:10,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors'], dtype='<U80')

Indexing example 1: Colons as *all* rows or columns

A colon can also denote all rows, or all columns. Here, we index all rows of column 4.

In [11]:
data1[:,4]

array(['Description', 'Electricity Net Generation From Coal, All Sectors',
       'Electricity Net Generation From Coal, All Sectors', ...,
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors',
       'Electricity Net Generation Total (including from sources not shown), All Sectors'],
      dtype='<U80')

We can use the same format for any dimension of an array. The general syntax is: array[start_row:end_row, start_col:end_col]. The following indexes all rows and the second column up to (but not including) the 4th column:

In [12]:
data1[:,2:4]

array([['Value', 'Column_Order'],
       ['135451.32', '1'],
       ['154519.994', '1'],
       ...,
       ['327858.999', '13'],
       ['357140.257', '13'],
       ['425654.556', '13']], dtype='<U80')

Explicitly specifying column numbers

What if the columns we need are not next to each other? Instead of indexing a range of columns, it can be useful to specify them explicitly. To explicitly specify particular columns, we just include them in a list. Let's index the five rows after the header, selecting only columns 2 and 3. This time, we'll write the output to a new array named subset that we can re-use in the following example.

In [13]:
sub = data1[1:6,[2,3]]
sub

array([['135451.32', '1'],
       ['154519.994', '1'],
       ['185203.657', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

Mask arrays

Another convenient way to index certain sections of a NumPy array is to use a mask array. A mask array, also known as a logical array, contains boolean elements (i.e. True or False). Indexing of a given array element is determined by the value of the mask array's corresponding element.

First, we define a NumPy array of True/False values, where the True values are the ones we want to keep. Then we mask the subset array from the previous example. The result is retaining only the rows that correspond to elements that are True in the mask array.

In [14]:
mask = np.array([False,True,False,True,True])
sub[mask]

array([['154519.994', '1'],
       ['195436.666', '1'],
       ['218846.325', '1']], dtype='<U80')

As you can see, the mask array retained the <font color ="green">rows corresponding to True</font> and <font color ="red">the excluded the ones corresponding to False.</font> It is worth noting that a similar approach is used for indexing pandas dataframes.

Masking is a powerful tool that allows us to index elements based on logical expressions. We'll make good use of in the case study later in the article.

<b>Concatenating</b>

NumPy also provides useful functions for concatenating (i.e., joining) arrays. Let's say we wanted to restrict our attention to the first and the last three rows of our dataset. First, we'll define new sub-arrays as follows: