# Python Tutorial : Day 2
Here's is what we are going to do today : **Summarizing Data**
1. [df.info()](#1)
2. [df.shape()](#2)
3. [df.size()](#3)
4. [df.ndim()](#4)
5. [df.index](#5)
6. [df.columns](#6)
7. [df.count()](#7)
8. [df.sum()](#8)
9. [df.cumsum()](#9)
10. [df.min()](#10)
11. [df.max()](#11)
12. [df.idxmin()](#12)
13. [df.idxmax()](#13)
14. [df.describe()](#14)
15. [df.mean()](#15)
16. [df.median()](#16)
17. [df.quantile()](#17)
18. [df.var()](#18)
19. [df.std()](#19)
20. [df.cummax()](#20)
21. [df.cummin()](#21)
22. [df['columnName'].cumprod()](#22)
23. [len(df)](#23)
24. [df.isnull()](#24)
25. [df.corr()](#25)

Let's get started!

[Daily news for stock market prediction](https://www.kaggle.com/aaron7sun/stocknews)

When we come up with a new dataset, the first thing we do is to analyze the dataset. Here are different inbuilt python functions which are very helpful in getting the summary of the dataset.There are lots of inbuilt functions in python but I have choosen some of the very important ones. Let's go through each functions one-by-one.

In [None]:
# import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# import data
df = pd.read_csv("/kaggle/input/stocknews/upload_DJIA_table.csv")

In [None]:
# looking at the top five rows of the data
df.head()

In [None]:
# looking at the bottom five rows of the data
df.tail()

## Summarize Data
It is easy to get information abut the data using pandas. Let's examine the inbuilt functions one-by-one :

### 1) **df.info()**<a id="1"></a>
This code provides detailed information about the data. This contains : 
   * RangeIndex : Specifies number of entries in the dataset
   * Data Columns : Specifies total number of columns
   * Columns : Gives information anout each column
   * dtypes : Specifies the datatype of each column
   * Memory Usage : Describes memory usage

In [None]:
df.info()

### 2) **df.shape**<a id="2"></a>
It returns tuple of shape (ROWS, COLUMNS) of dataframe/series.

In [None]:
df.shape

### 3) **df.size**<a id="3"></a>
It returns size of dataframe/series which is equivalent to total number of elements. That is rows x columns.

In [None]:
df.size

### 4)  **df.ndim**<a id="4"></a>
Returns dimension of dataframe/series. 1 for one dimension (series), 2 for two dimension (dataframe)

In [None]:
# dataframe
df.ndim

In [None]:
# for series
df['Date'].ndim

### **5) df.index** <a id="5"></a>
Returns total number of index found

In [None]:
df.index

### 6) **df.columns**<a id="6"></a>
Returns all the columns in the dataset

In [None]:
df.columns

### 7) **df.count()**<a id="7"></a>
It is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.

**Syntax: ** 

DataFrame.count(axis=0, level=None, numeric_only=False)

**Parameters:**
* **axis :** 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
* **level :** If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame
* **numeric_only :** Include only float, int, boolean data
* **Returns: count :** Series (or DataFrame if level specified)

In [None]:
df.count()

### 8) **df.sum()**<a id="8"></a>
Pandas dataframe.sum() function return the sum of the values for the requested axis. If the input is index axis then it adds all the values in a column and repeats the same for all the columns and returns a series containing the sum of all the values in each column. It also provides support to skip the missing values in the dataframe while calculating the sum in the dataframe.

**Syntax:** 
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

**Parameters :**
* **axis :** {index (0), columns (1)}
* **skipna :** Exclude NA/null values when computing the result.
* **level :** If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
* **numeric_only :** Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
* **min_count :** The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
* **Returns :** sum : Series or DataFrame (if level specified)

In [None]:
df.sum()

### 9) **df.cumsum()**<a id="9"></a>
Returns a DataFrame or Series of the same size containing the cumulative sum.

A **cumulative sum** is a sequence of partial sums of a given sequence. For example, the cumulative sums of the sequence {a,b,c,...}, are a, a+b, a+b+c, ...

In [None]:
df.cumsum().head()

### 10) **df.min()**<a id="10"></a>
Returns the minimum of the values in the given object.

In [None]:
df.min()

### 11) **df.max()**<a id="11"></a>
Returns the maximum of the values in the given object.

In [None]:
df.max()

### 12) **df.idxmin()**<a id="12"></a>
idxmin() function returns index of first occurrence of minimum over requested axis.While finding the index of the minimum value across any index, all NA/null values are excluded.

**Syntax:** DataFrame.idxmin(axis=0, skipna=True)

**Parameters :**
* **axis :** 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
* **skipna :** Exclude NA/null values. If an entire row/column is NA, the result will be NA

Returns : idxmin : Series

In [None]:
df['Open'].idxmin()

### 13) **df.idxmax()**<a id="13"></a>
idxmax() function returns index of first occurrence of maximum over requested axis.

In [None]:
df['Open'].idxmax()

### 14) **df.describe()**<a id="14"></a>
This Code provides basic statistical information about the data. The numerical column is based.

* **count:** number of entries
* **mean:** average of entries
* **std:** standard deviation
* **min:** minimum entry
* **25%:** first quantile
* **50%:** median or second quantile
* **75%:** third quantile
* **max:** maximum entry

In [None]:
df.describe()

### **15) df.mean()** <a id="15"></a>
This code returns the mean value for the numeric column.

In [None]:
df.mean()

### **16) df.median()** <a id="16"></a>
This code returns median for columns with numeric values.

In [None]:
df.median()

### **17) df.quantile()**<a id="17"></a>
df.quantile() function return values at the given quantile over requested axis.

**Note : **In each of any set of values of a variate which divide a frequency distribution into equal groups, each containing the same fraction of the total population.

In [None]:
df.quantile([0.25,0.75])

### **18) df.var()** <a id="18"></a>
Returns the variance for each column with a numeric value.

In [None]:
df.var()

### **19) df.std()**<a id="19"></a>
Returns the standard deviation for each column with numeric value.

In [None]:
df.std()

### **20) df.cummax()**<a id="20"></a> 
Calculates the cumulative max value between the data.

In [None]:
df.cummax().head()

### **21) df.cummin()** <a id="21"></a>
Calculates the cumulative min value between the data.

In [None]:
df.cummin().head()

### **22) df['columnname'].cumprod()**<a id="22"></a>
Returns the cumulative production of the data.

In [None]:
df['Open'].cumprod().head()

### **23) len(df)**<a id="23"></a>
Returns the number of entries in the dataset

In [None]:
len(df)

### **24) df.isnull()** <a id="24"></a>
Checks for null values, returns boolean.

In [None]:
df.isnull().head()

### **25) df.corr()** <a id="25"></a>
It gives information about the correlation between the data.

In [None]:
df.corr()

FOLLOW FOR MORE TUTORIALS!!!