In [3]:
import pandas as pd 

In [4]:
nba = pd.read_csv("nba.csv")

#### A Series is a single Dimensional Data strucuture. It basically a single column of data <br>A Dataframe is a 2-Dimensional Data structure. It consists of rows and columns. Think of DF as a table in Excel or a table<br>Dimensions have nothing to do with the no or rows or columns. The number of dimensions is equal to the number of points of reference we need to extract any single value from data structure. In case of Series we only need one reference .i.e Row label or row position to fetch a value. Thus `Series` is a 1-D structure.

## Shared Methods and Attributes

#### The read_csv method: Read a comma-separated values (csv) file into DataFrame.

`Signature:
pd.read_csv(
    filepath_or_buffer: Union[str, pathlib.Path, IO[~AnyStr]],
    sep=',',
    delimiter=None,
    header='infer',
    names=None,
    index_col=None,
    usecols=None,
    squeeze=False,
    prefix=None,
    mangle_dupe_cols=True,
    dtype=None,
    engine=None,
    converters=None,
    true_values=None,
    false_values=None,
    skipinitialspace=False,
    skiprows=None,
    skipfooter=0,
    nrows=None,
    na_values=None,
    keep_default_na=True,
    na_filter=True,
    verbose=False,
    skip_blank_lines=True,
    parse_dates=False,
    infer_datetime_format=False,
    keep_date_col=False,
    date_parser=None,
    dayfirst=False,
    cache_dates=True,
    iterator=False,
    chunksize=None,
    compression='infer',
    thousands=None,
    decimal: str = '.',
    lineterminator=None,
    quotechar='"',
    quoting=0,
    doublequote=True,
    escapechar=None,
    comment=None,
    encoding=None,
    dialect=None,
    error_bad_lines=True,
    warn_bad_lines=True,
    delim_whitespace=False,
    low_memory=True,
    memory_map=False,
    float_precision=None,
)`


In [9]:
nba = pd.read_csv("nba.csv")

In [10]:
nba.head(1) #Default no of rows is 5. We can explicitly specify no of rows to return. Ths=is returns a new dataframe with just 1st n rows from the original dataframe

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0


#### * Notice that pandas has automatically created a numeric index which begins with 0 and counts all the way upto the final record. Remember that this index is not the part of the csv. Its automatically generated. <br> We can think of all these columns multiple Series objects that have been glued together by a common index.

#### * NaN stands for Not a Number. However this is the lingo pandas uses to indicate missing values. The Last row in this dataset is completely empty. Thats why all the columns have the value of NaN.

#### * Also another point to remember here is that if the column contains numeric value (whole or decimal) and has even a single missing value .i.e NaN, those columns are interpreted by pandas as floating point. So if a column contains all integer values, but have even a single missing value, that column data type is interpreted by pandas a floats. So column with int data type never has any missing values. In this example Weight,Salary are actually integers, but missing value force them to be interpreted as floats and thats why they have .0 at the end of each value.

#### * Some of the methods on Dataframe and Series may be common. Note that their behaviour may be slightly different.


#### **The tail() method:** Return the last `n` rows. <br>This function returns last `n` rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows. For negative values of `n`, this function returns all rows except the first `n` rows, equivalent to ``df[n:]``.<br>The parameter n is an int and defaults to 5. This returns same type as the caller(Series or DF). 

`
Signature: nba.tail(n: int = 5) -> ~FrameOrSeries`

#### **The head() method:** Return the first `n` rows.<br> This function returns the first `n` rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. <br> For negative values of `n`, this function returns all rows except the last `n` rows, equivalent to ``df[:-n]``. <br>The parameter n is an int and defaults to 5. This returns same type as the caller(Series or DF). 

`Signature: nba.head(n: int = 5) -> ~FrameOrSeries`



In [5]:
nba.tail() #Default number of rows is 5. Series also have .head and .tail methods. Remember that a new dataframe is returned.

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


#### A Dataframe is composed of other smaller objects , each with a specialized task. The attributes like index,values,columns expose these inner objects

#### RangeIndex: Immutable Index implementing a monotonic integer range.<br>RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed.<br> This is the default index type used by DataFrame and Series when no explicit index is provided by the user.

In [6]:
nba.index #index is an attribute

RangeIndex(start=0, stop=458, step=1)

#### The values attribute:  Return a Numpy representation of the DataFrame.We recommend using :meth:`DataFrame.to_numpy` instead. Only the values in the DataFrame will be returned, the axes labels will be removed.


In [14]:
print(type(nba.values))
nba.values #This gives us the underlying ndarray object that is storing the values that make up a Dataframe.
#Pandas stores the values in a ndarray and adds an additional wrapper around it.

<class 'numpy.ndarray'>


array([['Avery Bradley', 'Boston Celtics', 0.0, ..., 180.0, 'Texas',
        7730337.0],
       ['Jae Crowder', 'Boston Celtics', 99.0, ..., 235.0, 'Marquette',
        6796117.0],
       ['John Holland', 'Boston Celtics', 30.0, ..., 205.0,
        'Boston University', nan],
       ...,
       ['Tibor Pleiss', 'Utah Jazz', 21.0, ..., 256.0, nan, 2900000.0],
       ['Jeff Withey', 'Utah Jazz', 24.0, ..., 231.0, 'Kansas', 947276.0],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=object)

#### **The shape attribute**: Return a tuple representing the dimensionality of the DataFrame.

In [12]:
nba.shape

(458, 9)

#### Series also has the dtypes attribute. In that case it provides the datatype of the Series. However when used with a dataframe it provides a Series with column names of the original dataframe as the index labels and datatypes as the values. The datatype 'object' normally means a String in Pandas Lingo.

In [13]:
nba.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

#### Using chaining as shown below we can determine the number of times different datatypes appear on a dataframe. Here nba.dtypes gives us a Series and on this series we execute the value_counts method.

In [15]:
nba.dtypes.value_counts()

object     5
float64    4
dtype: int64

In [14]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


#### * Dataframes also have a attribute 'columns' that provide Columns in the dataframe. Series dont have this attribute. Note that the type of object returned is an index. Thus pandas thinks of both row labels and column labels as indexes. Each Column name is a label used to identify a column.

#### **Index object**: Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects.

In [15]:
nba.columns

Index([u'Name', u'Team', u'Number', u'Position', u'Age', u'Height', u'Weight',
       u'College', u'Salary'],
      dtype='object')

#### * The `axes` attribute returns a list representing the axes of the DataFrame. It has the row axis labels and column axis labels as the only members. They are returned in that order. This attribute is specific to a Dataframe. Series dont have this attribute.

In [16]:
nba.axes #returns both indexes of the dataframe.

[RangeIndex(start=0, stop=458, step=1),
 Index([u'Name', u'Team', u'Number', u'Position', u'Age', u'Height', u'Weight',
        u'College', u'Salary'],
       dtype='object')]

#### * The `info` method on dataframe prints a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage. This also is not available with Series <br> We can also see the number of columns per datatype.

In [19]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     373 non-null object
Salary      446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB


##### * The method `get_dtype_counts` on dataframe returns counts of unique dtypes in this object.

In [5]:
nba.get_dtype_counts()

AttributeError: 'DataFrame' object has no attribute 'get_dtype_counts'

#### index_col : int, str, sequence of int / str, or False, default ``None``<br> Column(s) to use as the row labels of the ``DataFrame``, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.<br> Note: ``index_col=False`` can be used to force pandas to *not* use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

In [6]:
rev = pd.read_csv("revenue.csv", index_col = "Date")
rev.head(3)

Unnamed: 0_level_0,New York,Los Angeles,Miami
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/16,985,122,499
1/2/16,738,788,534
1/3/16,14,20,933


In [10]:
s = pd.Series([1, 2, 3])
s.sum()

6

#### By default the below sum() method returns a Series with Columns in the original dataframe as the index labels in the output Series and sum of values of each columns as the values of the output Series. Thus sum is calculated per column.

In [9]:
rev.sum()

New York       5475
Los Angeles    5134
Miami          5641
dtype: int64

#### Axis parameter specifies how the summation operation is to be done. Default is axis=0 or axis='index'. This sums all values in individual columns. Other option is to use axis=1 or axis='columns'. This sums all values for individual rows. For each row, values from all numeric columns is summed up.

In [8]:
rev.sum(axis = "columns")

Date
1/1/16     1606
1/2/16     2060
1/3/16      967
1/4/16     2519
1/5/16      438
1/6/16     1935
1/7/16     1234
1/8/16     2313
1/9/16     2623
1/10/16     555
dtype: int64

#### The sum() method is available on both Series and Dataframe, but the axis parameters is available only for Dataframes.
#### So the same method behaves differently for Series and Dataframes. The axis parameter is also available on many other methods.

## Select One Column from a `DataFrame`

In [40]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### Below is one of the ways of extracting single columns from dataframe. Note that selecting a single column returns a Series object. <br> This approach works well when there are not spaces in the column names

In [11]:
nba.Name

0      Avery Bradley
1        Jae Crowder
2       John Holland
3        R.J. Hunter
4      Jonas Jerebko
           ...      
453     Shelvin Mack
454        Raul Neto
455     Tibor Pleiss
456      Jeff Withey
457              NaN
Name: Name, Length: 458, dtype: object

#### Better approach would be to make use of [] syntax. With Series, we use [] to get individual values from the Series. However with Dataframes, we use the [] notation to extract columns. Column name is provided in Quotes. As mentioned earlier return type id a Series. <br> This option works even if column has spaces in it.

In [12]:
nba["Name"]

0      Avery Bradley
1        Jae Crowder
2       John Holland
3        R.J. Hunter
4      Jonas Jerebko
           ...      
453     Shelvin Mack
454        Raul Neto
455     Tibor Pleiss
456      Jeff Withey
457              NaN
Name: Name, Length: 458, dtype: object

## Select Two or More Columns from A `DataFrame`

In [64]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### For selecting multiple columns,we use [] notation, but pass a list of columns that we want to extract.<br> Return type is a new Dataframe <br> Order of selecting columns from DF does not need to match the order of columns in DF.

In [70]:
nba[["Team", "Name"]].head(3)
nba[["Number", "College"]]
nba[["Salary", "Team", "Name"]].tail()

Unnamed: 0,Salary,Team,Name
453,2433333.0,Utah Jazz,Shelvin Mack
454,900000.0,Utah Jazz,Raul Neto
455,2900000.0,Utah Jazz,Tibor Pleiss
456,947276.0,Utah Jazz,Jeff Withey
457,,,


In [71]:
select = ["Salary", "Team", "Name"]
nba[select]

Unnamed: 0,Salary,Team,Name
0,7730337.0,Boston Celtics,Avery Bradley
1,6796117.0,Boston Celtics,Jae Crowder
2,,Boston Celtics,John Holland
3,1148640.0,Boston Celtics,R.J. Hunter
4,5000000.0,Boston Celtics,Jonas Jerebko
5,12000000.0,Boston Celtics,Amir Johnson
6,1170960.0,Boston Celtics,Jordan Mickey
7,2165160.0,Boston Celtics,Kelly Olynyk
8,1824360.0,Boston Celtics,Terry Rozier
9,3431040.0,Boston Celtics,Marcus Smart


## Add New Column to `DataFrame`

#### We create a new column in dataframe simply by assigning to a column name that does not already exists. If the column name already exist, it will be over written.<br> On the right hand side of the =, we could have a Series object that would align with the Dataframe or a Scalar value. If a scalar value is assigned, all the rows in the DF get the same scalar value <br> The assignment approach adds the column to the end of the Dataframe.

#### Other option to create a new column is to use the `insert` method.
#### The insert() method: Insert column into DataFrame at specified location. Raises a ValueError if `column` is already contained in the DataFrame, unless `allow_duplicates` is set to True. All other columns are pushed to the right. Also this change is made in place by default. We dont need to assign back or make use of the inplace parameter.

`Signature: nba.insert(loc, column, value, allow_duplicates=False) -> None
loc : int
    Insertion index. Must verify 0 <= loc <= len(columns).
column : str, number, or hashable object
    Label of the inserted column.
value : int, Series, or array-like
allow_duplicates : bool, optional`

In [13]:
nba = pd.read_csv("nba.csv")
nba.head(3)

nba["Sport"] = "Basketball"
nba.head(3)

nba["League"] = "National Basketball Association"
nba.head(3)

nba = pd.read_csv("nba.csv")
nba.head(3)

nba.insert(3, column = "Sport", value = "Basketball")
nba.head(3)

nba.insert(7, column = "League", value = "National Basketball Association")
Output = None #Prevents output from the Cell.

## Broadcasting Operations

#### Broadcasting operations are the ones that are applied to each element of a Series separately. This is opposite of other operations like sort_values etc which are applied to entire Series as whole.<br> Broadcast methods are alternatives to using apply() method of applying same operation on all elements.

In [103]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### The add() broadcast operation: Get Addition of dataframe and other, element-wise (binary operator `add`). <br> Equivalent to ``dataframe + other``, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, `radd`. <br> Among flexible wrappers (`add`, `sub`, `mul`, `div`, `mod`, `pow`) to arithmetic operators: `+`, `-`, `*`, `/`, `//`, `%`, `**`.


`Signature: nba.add(other, axis='columns', level=None, fill_value=None)`


`Parameters:
other : scalar, sequence, Series, or DataFrame
    Any single or multiple element data structure, or list-like object.
axis : {0 or 'index', 1 or 'columns'}
    Whether to compare by the index (0 or 'index') or columns (1 or 'columns'). For Series input, axis to match Series index on.
level : int or label
    Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value : float or None, default None
    Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation.
    If data in both corresponding DataFrame locations is missing the result will be missing.`

`Returns:
DataFrame
    Result of the arithmetic operation.`

In [14]:
nba["Age"].add(5)
nba["Age"] + 5
#If nba["Age"] has any null values, pandas will not throw any error. The result of the operation would still be NAn.

nba["Salary"].sub(5000000)
nba["Salary"] - 5000000

nba["Weight"].mul(0.453592)
nba["Weight in Kilograms"] = nba["Weight"] * 0.453592  #Creates a new Column in the dataframe.

In [115]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Weight in Kilograms
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,92.98636


In [119]:
nba["Salary"].div(1000000)
nba["Salary in Millions"] = nba["Salary"] / 1000000

In [120]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Weight in Kilograms,Salary in Millions
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0,81.64656,7.730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0,106.59412,6.796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,,92.98636,


## A Review of the `.value_counts()` Method

In [122]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [132]:
nba["Team"].value_counts()
nba["Position"].value_counts().head(1)
nba["Weight"].value_counts().tail()
nba["Salary"].value_counts()

947276.0      31
845059.0      18
525093.0      13
981348.0       6
16407500.0     5
4000000.0      5
5000000.0      5
1100602.0      5
8000000.0      5
12000000.0     5
1000000.0      4
7000000.0      4
2814000.0      4
3000000.0      4
19689000.0     4
2500000.0      3
13500000.0     3
2854940.0      3
8500000.0      3
5543725.0      3
1015421.0      3
200600.0       3
14260870.0     2
1140240.0      2
15851950.0     2
55722.0        2
1270964.0      2
13000000.0     2
3425510.0      2
2288205.0      2
              ..
2841960.0      1
17120106.0     1
1142880.0      1
4950000.0      1
306527.0       1
206192.0       1
1200000.0      1
900000.0       1
22875000.0     1
4300000.0      1
250750.0       1
5219169.0      1
1160160.0      1
2357760.0      1
6110034.0      1
18671659.0     1
22359364.0     1
8042895.0      1
5192520.0      1
25000000.0     1
3300000.0      1
1749840.0      1
1724250.0      1
10000000.0     1
1320000.0      1
5103120.0      1
6796117.0      1
4088019.0     

## Drop Rows with Null Values

In [133]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### The dropna() method: Remove missing values.


`Signature: nba.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)`

`Parameters
axis : {0 or 'index', 1 or 'columns'}, default 0
    Determine if rows or columns which contain missing values are removed.
    * 0, or 'index' : Drop rows which contain missing values.
    * 1, or 'columns' : Drop columns which contain missing value.
    .. versionchanged:: 1.0.0
       Pass tuple or list to drop on multiple axes.
       Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
    * 'any' : If any NA values are present, drop that row or column.
    * 'all' : If all values are NA, drop that row or column.
thresh : int, optional
    Require that many non-NA values.
subset : array-like, optional
    Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace : bool, default False
    If True, do operation inplace and return None.`

In [None]:
nba.dropna()  #Drops all rows that have atleast one null value.

In [134]:
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


In [137]:
nba.dropna(how = "all", inplace = True) #drops rows only if all values in the row are null.

In [138]:
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [15]:
nba.dropna(axis=1) #drops columns if any of the value is null

Unnamed: 0,Sport,League
0,Basketball,National Basketball Association
1,Basketball,National Basketball Association
2,Basketball,National Basketball Association
3,Basketball,National Basketball Association
4,Basketball,National Basketball Association
...,...,...
453,Basketball,National Basketball Association
454,Basketball,National Basketball Association
455,Basketball,National Basketball Association
456,Basketball,National Basketball Association


In [141]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [16]:
nba.dropna(subset = ["Salary", "College"]) #We can provide specific list of columns to check for dropping rows.

Unnamed: 0,Name,Team,Number,Sport,Position,Age,Height,League,Weight,College,Salary,Weight in Kilograms
0,Avery Bradley,Boston Celtics,0.0,Basketball,PG,25.0,6-2,National Basketball Association,180.0,Texas,7730337.0,81.646560
1,Jae Crowder,Boston Celtics,99.0,Basketball,SF,25.0,6-6,National Basketball Association,235.0,Marquette,6796117.0,106.594120
3,R.J. Hunter,Boston Celtics,28.0,Basketball,SG,22.0,6-5,National Basketball Association,185.0,Georgia State,1148640.0,83.914520
6,Jordan Mickey,Boston Celtics,55.0,Basketball,PF,21.0,6-8,National Basketball Association,235.0,LSU,1170960.0,106.594120
7,Kelly Olynyk,Boston Celtics,41.0,Basketball,C,25.0,7-0,National Basketball Association,238.0,Gonzaga,2165160.0,107.954896
...,...,...,...,...,...,...,...,...,...,...,...,...
449,Rodney Hood,Utah Jazz,5.0,Basketball,SG,23.0,6-8,National Basketball Association,206.0,Duke,1348440.0,93.439952
451,Chris Johnson,Utah Jazz,23.0,Basketball,SF,26.0,6-6,National Basketball Association,206.0,Dayton,981348.0,93.439952
452,Trey Lyles,Utah Jazz,41.0,Basketball,PF,20.0,6-10,National Basketball Association,234.0,Kentucky,2239800.0,106.140528
453,Shelvin Mack,Utah Jazz,8.0,Basketball,PG,26.0,6-3,National Basketball Association,203.0,Butler,2433333.0,92.079176


## Fill in Null Values with the `.fillna()` Method

In [20]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### The fillna() method: Fill NA/NaN values using the specified method.

`Signature:
nba.fillna(
    value=None,
    method=None,
    axis=None,
    inplace=False,
    limit=None,
    downcast=None,
) -> Union[ForwardRef('DataFrame'), NoneType]`

`Parameters
value : scalar, dict, Series, or DataFrame
    Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame).  Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
    Method to use for filling holes in reindexed Series 
    pad / ffill: propagate last valid observation forward to next valid
    backfill / bfill: use next valid observation to fill gap.
axis : {0 or 'index', 1 or 'columns'}
    Axis along which to fill missing values.
inplace : bool, default False
    If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit : int, default None
    If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast : dict, default is None
    A dict of item->dtype of what to downcast if possible, or the string 'infer' which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).` 

In [21]:
nba.fillna(0)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,0,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,0,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [23]:
nba["Salary"].fillna(0, inplace = True)

In [24]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


In [27]:
nba["College"].fillna("No College", inplace = True)

In [28]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,No College,5000000.0


## The `.astype()` Method

In [17]:
nba = pd.read_csv("nba.csv").dropna(how = "all")
nba["Salary"].fillna(0, inplace = True)
nba["College"].fillna("None", inplace = True)
nba.head(6)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0


In [18]:
nba.dtypes
nba.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 457 entries, 0 to 456
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       457 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   457 non-null    object 
 8   Salary    457 non-null    float64
dtypes: float64(4), object(5)
memory usage: 35.7+ KB


#### Imp to note that for astype to work, there should be no nulls in the column. 

#### The astype() method: Cast a pandas object to a specified dtype ``dtype``.

`Signature: nba.astype(dtype, copy: bool = True, errors: str = 'raise') -> ~FrameOrSeries`
`Parameters
dtype : data type, or dict of column name -> data type
    Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, ...}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
copy : bool, default True
    Return a copy when ``copy=True`` (be very careful setting ``copy=False`` as changes to values then may propagate to other pandas objects).
errors : {'raise', 'ignore'}, default 'raise'
    Control raising of exceptions on invalid data for provided dtype.
    - ``raise`` : allow exceptions to be raised
    - ``ignore`` : suppress exceptions. On error return original object.`

In [166]:
nba["Salary"] = nba["Salary"].astype("int")

In [167]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


In [173]:
nba["Number"] = nba["Number"].astype("int")
nba["Age"] = nba["Age"].astype("int")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30,SG,27,6-5,205.0,Boston University,0


In [175]:
nba["Age"].astype("float")

0      25.0
1      25.0
2      27.0
3      22.0
4      29.0
5      29.0
6      21.0
7      25.0
8      22.0
9      22.0
10     24.0
11     27.0
12     27.0
13     20.0
14     26.0
15     27.0
16     24.0
17     28.0
18     21.0
19     32.0
20     22.0
21     26.0
22     23.0
23     28.0
24     21.0
25     26.0
26     25.0
27     26.0
28     28.0
29     27.0
       ... 
427    20.0
428    25.0
429    23.0
430    24.0
431    27.0
432    23.0
433    28.0
434    34.0
435    24.0
436    25.0
437    24.0
438    23.0
439    26.0
440    30.0
441    20.0
442    28.0
443    23.0
444    24.0
445    20.0
446    24.0
447    23.0
448    26.0
449    23.0
450    28.0
451    26.0
452    20.0
453    26.0
454    24.0
455    26.0
456    26.0
Name: Age, dtype: float64

In [178]:
nba["Position"].nunique()

5

#### The "category" is a new datatype option available in Pandas. It can be used for columns that have less number of unique values. Instead of repeating same value for all the rows, it makes use of pointers to same objects in memory. This helps reduce memory usage.

In [181]:
nba["Position"] = nba["Position"].astype("category")

In [186]:
nba["Team"] = nba["Team"].astype("category")

In [187]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30,SG,27,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28,SG,22,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8,PF,29,6-10,231.0,,5000000


## Sort a `DataFrame` with the `.sort_values()` Method, Part I

In [2]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


#### The sort_values() method on DF: Sort by the values along either axis.

`Signature:
nba.sort_values(
    by,
    axis=0,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',
    ignore_index=False,
)`

`Parameters
        by : str or list of str
            Name or list of names to sort by.
            - if `axis` is 0 or `'index'` then `by` may contain index
              levels and/or column labels.
            - if `axis` is 1 or `'columns'` then `by` may contain column
              levels and/or index labels.
            .. versionchanged:: 0.23.0
               Allow specifying index or column level names.
axis : {0 or 'index', 1 or 'columns'}, default 0
     Axis to be sorted.
ascending : bool or list of bool, default True
     Sort ascending vs. descending. Specify list for multiple sort orders.  If this is a list of bools, must match the length of the by.
inplace : bool, default False
     If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
     Choice of sorting algorithm. See also ndarray.np.sort for more information.  `mergesort` is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {'first', 'last'}, default 'last'
     Puts NaNs at the beginning if `first`; `last` puts NaNs at the end.
ignore_index : bool, default False
     If True, the resulting axis will be labeled 0, 1, …, n - 1. `

In [9]:
nba.sort_values("Name", ascending = False)

nba.sort_values("Age", ascending = False)

nba.sort_values("Salary", ascending = False, inplace = True)
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000.0
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500.0
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000.0


#### By default, the NaN values would be place at end. NaN are greater than the highest value in the column.

In [15]:
nba.sort_values("Salary", ascending = False, na_position = "first").tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
175,Jordan McRae,Cleveland Cavaliers,12.0,SG,25.0,6-5,179.0,Tennessee,111196.0
135,Alan Williams,Phoenix Suns,15.0,C,23.0,6-8,260.0,UC Santa Barbara,83397.0
291,Orlando Johnson,New Orleans Pelicans,0.0,SG,27.0,6-5,220.0,UC Santa Barbara,55722.0
130,Phil Pressey,Phoenix Suns,25.0,PG,25.0,5-11,175.0,Missouri,55722.0
32,Thanasis Antetokounmpo,New York Knicks,43.0,SF,23.0,6-7,205.0,,30888.0


## Sort a `DataFrame` with the `.sort_values()` Method, Part II

In [16]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [20]:
nba.sort_values(["Team", "Name"], ascending = [True, False], inplace = True)
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
322,Walter Tavares,Atlanta Hawks,22.0,C,24.0,7-3,260.0,,1000000.0
310,Tim Hardaway Jr.,Atlanta Hawks,10.0,SG,24.0,6-6,205.0,Michigan,1304520.0
321,Tiago Splitter,Atlanta Hawks,11.0,C,31.0,6-11,245.0,,9756250.0


## Sort `DataFrame` with the `.sort_index()` Method

In [21]:
nba = pd.read_csv("nba.csv")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [23]:
nba.sort_values(["Number", "Salary", "Name"], inplace = True)
nba.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
68,Lucas Nogueira,Toronto Raptors,92.0,C,23.0,7-0,220.0,,1842000.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
457,,,,,,,,,


#### The sort_index() method on dataframe: Sort object by labels (along an axis).

`Signature:
nba.sort_index(
    axis=0,
    level=None,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',
    sort_remaining=True,
    ignore_index: bool = False,
)`

`axis : {0 or 'index', 1 or 'columns'}, default 0
    The axis along which to sort.  The value 0 identifies the rows, and 1 identifies the columns.
level : int or level name or list of ints or list of level names
    If not None, sort on values in specified index level(s).
ascending : bool, default True
    Sort ascending vs. descending.
inplace : bool, default False
    If True, perform operation in-place.
kind : {'quicksort', 'mergesort', 'heapsort'}, default 'quicksort'
    Choice of sorting algorithm. See also ndarray.np.sort for more information.  `mergesort` is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
na_position : {'first', 'last'}, default 'last'
    Puts NaNs at the beginning if `first`; `last` puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining : bool, default True
    If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index : bool, default False
    If True, the resulting axis will be labeled 0, 1, …, n - 1.`

In [26]:
nba.sort_index(ascending = False, inplace = True)

In [27]:
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
457,,,,,,,,,
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0


## Rank Values with the `.rank()` Method

In [28]:
nba = pd.read_csv("nba.csv").dropna(how = "all")
nba["Salary"] = nba["Salary"].fillna(0).astype("int")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0


#### The rank() method on dataframe: Compute numerical data ranks (1 through n) along axis. <br> By default, equal values are assigned a rank that is the average of the ranks of those values. Note that this can be invoked on Dataframe or Series
 
`Signature:
nba.rank(
    axis=0,
    method: str = 'average',
    numeric_only: Union[bool, NoneType] = None,
    na_option: str = 'keep',
    ascending: bool = True,
    pct: bool = False,
) -> ~FrameOrSeries`


`Parameters
axis : {0 or 'index', 1 or 'columns'}, default 0
    Index to direct ranking.
method : {'average', 'min', 'max', 'first', 'dense'}, default 'average'
    How to rank the group of records that have the same value (i.e. ties):
    * average: average rank of the group
    * min: lowest rank in the group
    * max: highest rank in the group
    * first: ranks assigned in order they appear in the array
    * dense: like 'min', but rank always increases by 1 between groups.
numeric_only : bool, optional
    For DataFrame objects, rank only numeric columns if set to True.
na_option : {'keep', 'top', 'bottom'}, default 'keep'
    How to rank NaN values:
    * keep: assign NaN rank to NaN values
    * top: assign smallest rank to NaN values if ascending
    * bottom: assign highest rank to NaN values if ascending.
ascending : bool, default True
    Whether or not the elements should be ranked in ascending order.
pct : bool, default False
    Whether or not to display the returned rankings in percentile form.`

In [33]:
nba["Salary Rank"] = nba["Salary"].rank(ascending = False).astype("int")
nba.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary Rank
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337,97
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117,110
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,0,452


In [35]:
nba.sort_values(by = "Salary", ascending = False)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary,Salary Rank
109,Kobe Bryant,Los Angeles Lakers,24.0,SF,37.0,6-6,212.0,,25000000,1
169,LeBron James,Cleveland Cavaliers,23.0,SF,31.0,6-8,250.0,,22970500,2
33,Carmelo Anthony,New York Knicks,7.0,SF,32.0,6-8,240.0,Syracuse,22875000,3
251,Dwight Howard,Houston Rockets,12.0,C,30.0,6-11,265.0,,22359364,4
339,Chris Bosh,Miami Heat,1.0,PF,32.0,6-11,235.0,Georgia Tech,22192730,5
100,Chris Paul,Los Angeles Clippers,3.0,PG,31.0,6-0,175.0,Wake Forest,21468695,6
414,Kevin Durant,Oklahoma City Thunder,35.0,SF,27.0,6-9,240.0,Texas,20158622,7
164,Derrick Rose,Chicago Bulls,1.0,PG,27.0,6-3,190.0,Memphis,20093064,8
349,Dwyane Wade,Miami Heat,3.0,SG,34.0,6-4,220.0,Marquette,20000000,9
174,Kevin Love,Cleveland Cavaliers,0.0,PF,27.0,6-10,251.0,UCLA,19689000,11


In [21]:
nba.rank().sort_values("Age")

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
226,361.0,251.5,284.5,406.5,1.5,248.5,125.0,404.5,183.0
122,111.0,360.0,30.5,406.5,1.5,248.5,152.0,152.5,201.0
40,262.0,300.5,123.5,128.5,12.0,455.5,333.5,256.5,291.0
401,442.0,266.5,30.5,224.5,12.0,125.5,85.0,70.5,153.0
427,78.0,375.0,395.0,128.5,12.0,336.0,333.5,131.5,43.0
...,...,...,...,...,...,...,...,...,...
261,445.0,219.5,251.5,406.5,453.5,248.5,220.0,307.5,290.0
102,344.0,188.0,170.5,224.5,453.5,150.0,36.5,256.5,107.0
298,419.0,405.0,298.0,39.5,456.0,97.5,390.5,431.5,323.0
400,251.0,266.5,298.0,128.5,456.0,97.5,333.5,256.5,375.0
