# Libaries

In [1]:
import pandas as pd
import requests
from io import StringIO

### Creating a DataFrame

In [2]:
df= pd.read_csv('../Dataset/placement.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0


The dataset provides information on the placement status of students, as well as their CGPA and IQ scores. Each entry represents a student's record, which includes their identification, CGPA, IQ, and whether they were placed in a job after completing their education. The placement column indicates whether the student was successfully placed (1) or not placed (0) in a job. The data can be used to analyze patterns and correlations between academic performance (CGPA), cognitive ability (IQ), and placement outcomes.

### df is the datafram of dataset.
- df.head(5) : shows the first 5 rows of the dataset.
- df.tail(5)  : shows the last 5  rows of the dataset

## Accessing csv via url

In [3]:
!pip install requests



In [4]:
URL = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv" 
HEADER = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}

req = requests.get(url=URL, headers=HEADER)
data = StringIO(req.text)
df_url = pd.read_csv(data)

The dataset provides information about various countries along with their respective regions. Each entry lists a country and the region it belongs to. The regions are categorized by continents such as Africa, South America, and so on. This dataset can be used to analyze the geographic distribution of countries and their regional classifications.

### Separator Parameter

- The **separator parameter** is used in CSV files to define the character that separates values in each row. Common delimiters include:
  - **Comma (`,`)**: The most common delimiter in CSV files (standard for many applications like Excel).
  - **Semicolon (`;`)**: Often used in European countries or in situations where commas are already part of the data.
  - **Tab (`\t`)**: A tab character, often used in **TSV (Tab-Separated Values)** files.
  - **Pipe (`|`)**: Sometimes used in data systems where commas or semicolons might conflict with the data.

This parameter is crucial when reading or writing CSV files, as it tells the program how to correctly split each row into individual values.

For example:
- A CSV with comma-separated values: `"name, age, location"`
- A CSV with semicolon-separated values: `"name; age; location"`

In [5]:
df_tsv= pd.read_csv('../Dataset/movie_titles_metadata.tsv', sep='\t')
df_tsv.head(5)

Unnamed: 0,m0,10 things i hate about you,1999,6.90,62847,['comedy' 'romance']
0,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
1,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
2,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
3,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
4,m5,the fifth element,1997,7.5,133756.0,['action' 'adventure' 'romance' 'sci-fi' 'thri...


In [6]:
df_tsv.columns

Index(['m0', '10 things i hate about you', '1999', '6.90', '62847',
       '['comedy' 'romance']'],
      dtype='object')

Here the headers of the csv are missing. We can add the headers via passing **name** arguments.

In [7]:
df_tsv= pd.read_csv('../Dataset/movie_titles_metadata.tsv', sep='\t', names=['sno','name','release_year','rating','votes','genres'])
df_tsv.head(5)

Unnamed: 0,sno,name,release_year,rating,votes,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


## Index Col parameter 
- Index_col= 'Column Name'
- If we want to use a particular column to make its index we can use this parameter

In [8]:
df_idx_col=pd.read_csv('../Dataset/aug_train.csv')
df_idx_col.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


This dataset contains information about individuals, likely used for job prediction or talent analytics. Each row represents a person's profile, with the following columns:

- **enrollee_id**: Unique identifier for each individual.
- **city**: Encoded city of residence (e.g., `city_103`).
- **city_development_index**: A score (between 0 and 1) indicating how developed the city is.
- **gender**: Gender of the individual (e.g., Male, Female, Other).
- **relevent_experience**: Indicates if the person has relevant experience in their field.
- **enrolled_university**: Enrollment status in a university (e.g., no_enrollment, Full time course).
- **education_level**: Highest level of education attained (e.g., Graduate, Masters).
- **major_discipline**: Field of study (e.g., STEM, Arts).
- **experience**: Total years of work experience.
- **company_size**: Size of the most recent company worked at.
- **company_type**: Type of the most recent employer (e.g., Pvt Ltd, NGO).
- **last_new_job**: Time since the last job change (e.g., never, 1, >4).
- **training_hours**: Number of hours of training completed.
- **target**: Target variable indicating if the person is looking for a new job (likely binary: 1 = looking, 0 = not looking).


In [9]:
df_idx_col=pd.read_csv('../Dataset/aug_train.csv', index_col='enrollee_id')
df_idx_col.head(5)

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


Now the enrolee_id is the index of the csv file.

## Read the particular couloumn in CSV.
- usecols=['Coulmn_1','Cloumn_2','Column_3']
- The `usecols` parameter in Pandas is used with functions like `read_csv()` to **load only specific columns** from a file, improving performance and memory usage.
- Helps in memory management.

In [10]:
df_use_cols= pd.read_csv('../Dataset/aug_train.csv', usecols=['city', 'gender'])
df_use_cols.head(5)

Unnamed: 0,city,gender
0,city_103,Male
1,city_40,Male
2,city_21,
3,city_115,
4,city_162,Male


## `squeeze` Parameter in Pandas

The `squeeze` parameter is used to convert a DataFrame with only one column or row into a **Series** automatically.

### Syntax:
```python
df.squeeze('colums')


In [11]:
df_squeeze=pd.read_csv('../Dataset/aug_train.csv', usecols=['city'])
df_squeeze.squeeze('columns')

0        city_103
1         city_40
2         city_21
3        city_115
4        city_162
           ...   
19153    city_173
19154    city_103
19155    city_103
19156     city_65
19157     city_67
Name: city, Length: 19158, dtype: object

# Dealing with bad lines in data
## Skiprows/nrows Parameter

When reading data (e.g. from a CSV file) using programming tools like Python's `pandas`, you can **skip rows** that are not needed—such as headers, footers, or metadata—using parameters like `skiprows`.

**Example in Python (pandas):**
```python
import pandas as pd

# Skip the first 2 rows of the file
df = pd.read_csv('data.csv', skiprows=2)
``

In [12]:
df=pd.read_csv('../Dataset/aug_train.csv')
df.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [13]:
df_skips = pd.read_csv('../Dataset/aug_train.csv', skiprows=3)
df_skips.head(5)

Unnamed: 0,11561,city_21,0.624,Unnamed: 3,No relevent experience,Full time course,Graduate,STEM,5,Unnamed: 9,Unnamed: 10,never,83,0.0
0,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
1,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
2,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
3,28806,city_160,0.92,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
4,402,city_46,0.762,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,<10,Pvt Ltd,>4,18,1.0


Skips the first 3 rows in df.

In [14]:
df_n_rows = pd.read_csv('../Dataset/aug_train.csv', nrows=3)
print(df_n_rows.shape)
df_n_rows.head(5)

(3, 14)


Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0


This  reads only the first 3 rows

## Skipping Bad Lines

When reading data files (like CSVs), some lines may be malformed or contain errors. To prevent these lines from causing the program to crash, you can **skip bad lines** during the import process.

**Example in Python (pandas):**
```python
import pandas as pd

# Skip lines with too many fields or parsing errors
df = pd.read_csv('data.csv', on_bad_lines='skip')


In [15]:
df=pd.read_csv('../Dataset/BX-Books.csv', sep=';', encoding='latin-1') 
df.head(5)

ParserError: Error tokenizing data. C error: Expected 8 fields in line 6451, saw 9


This shows the error of bad lines.


link of dataset : url(https://www.kaggle.com/datasets/alizaynoor/bx-books-csv?select=BX-Books.csv)

In [16]:
df_bad_lines=pd.read_csv('../Dataset/BX-Books.csv', sep=';',encoding="latin-1", on_bad_lines='skip') 
df_bad_lines.head()

  df_bad_lines=pd.read_csv('../Dataset/BX-Books.csv', sep=';',encoding="latin-1", on_bad_lines='skip')


Unnamed: 0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
0,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
1,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
2,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
3,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
4,399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...


Skips the bad lines which are in dataset.

### Changing the datatype

In [17]:
df= pd.read_csv('../Dataset/aug_train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

`df.info()` method in pandas provides a quick summary of a DataFrame. It displays the total number of entries, the number of non-null values in each column, the data type of each column, and the memory usage. This information is useful for understanding the structure of the dataset and identifying missing values or data type issues early in the data analysis process.


In [18]:
df_dtypes= pd.read_csv('../Dataset/aug_train.csv',dtype={'target':int}) #chaning the datatype of 'target' column from float64 to int
df_dtypes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  int64  
dtypes: float64(1), int64(3), object(10)
me

The `dtype` of 'target' column is `int64` now.

## Handling the date

The `date` column in the csv dataset are consided as an `object` datatype.

In [19]:
df_dates = pd.read_csv('../Dataset/IPL Matches 2008-2020.csv')
df_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               816 non-null    int64  
 1   city             803 non-null    object 
 2   date             816 non-null    object 
 3   player_of_match  812 non-null    object 
 4   venue            816 non-null    object 
 5   neutral_venue    816 non-null    int64  
 6   team1            816 non-null    object 
 7   team2            816 non-null    object 
 8   toss_winner      816 non-null    object 
 9   toss_decision    816 non-null    object 
 10  winner           812 non-null    object 
 11  result           812 non-null    object 
 12  result_margin    799 non-null    float64
 13  eliminator       812 non-null    object 
 14  method           19 non-null     object 
 15  umpire1          816 non-null    object 
 16  umpire2          816 non-null    object 
dtypes: float64(1), i

The `datatype` of the `'date'` column is `object`

In [20]:
df_dates = pd.read_csv('../Dataset/IPL Matches 2008-2020.csv', parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               816 non-null    int64         
 1   city             803 non-null    object        
 2   date             816 non-null    datetime64[ns]
 3   player_of_match  812 non-null    object        
 4   venue            816 non-null    object        
 5   neutral_venue    816 non-null    int64         
 6   team1            816 non-null    object        
 7   team2            816 non-null    object        
 8   toss_winner      816 non-null    object        
 9   toss_decision    816 non-null    object        
 10  winner           812 non-null    object        
 11  result           812 non-null    object        
 12  result_margin    799 non-null    float64       
 13  eliminator       812 non-null    object        
 14  method           19 non-null     object   

The `datatype` of the `'date'` column has changed from `object` to `datetime`

## Calling Function in dataframe.
`converters` : We can call the `function` and can do the `data manipulation` in as required in the column by using converters parameter.

In [24]:
df= pd.read_csv('../Dataset/placement.csv')
df.head(5)

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,6.8,123.0,1
1,1,5.9,106.0,0
2,2,5.3,121.0,0
3,3,7.4,132.0,1
4,4,5.8,142.0,0


In [22]:
def pd_fun(cloumn_data):
    '''Function multiplies the value by 10 '''
    return float(cloumn_data)*10

In [23]:
df_fun= pd.read_csv('../Dataset/placement.csv', converters={'cgpa' : pd_fun},)
df_fun.head(5)

Unnamed: 0.1,Unnamed: 0,cgpa,iq,placement
0,0,68.0,123.0,1
1,1,59.0,106.0,0
2,2,53.0,121.0,0
3,3,74.0,132.0,1
4,4,58.0,142.0,0


The cgpa values has been multiplied by 10.