# Intro to Data Science

In [1]:
import pandas as pd

## Steps 1 & 2
### Question
We want to know which features affect how long a UFO is seen in the sky
### Data Aquistion
The data we are working with contains information on UFO sightings around the world.
This data was collected from Kaggle.com, a popular free website that contains a numerous free datasets to conduct analysis on. Link to dataset: https://www.kaggle.com/datasets/NUFORC/ufo-sightings
## Read in csv file
a csv file is very similar to how an excel spreadsheet stores data. In general terms, it stores data like a relational database where there are a fixed number of columns and rows.

Once the data from the csv is read into python, we store it in a variable called a dataframe. Many people also refer to it as a pandas dataframe because that is the package the provides the functionality of dataframes


In [35]:
#how to read in a csv file
ufo = pd.read_csv("data/alien_df.csv")

#a way to quickly view the first 5 entries in the dataframe
ufo.head()

Unnamed: 0,city,state,country,shape,duration_seconds,date.posted,latitude,longitude
0,san marcos,tx,us,cylinder,2700,4/27/2004,29.883056,-97.941111
1,lackland afb,tx,,light,7200,12/16/2005,29.38421,-98.581082
2,chester (uk/england),,gb,circle,20,1/21/2008,53.2,-2.916667
3,edna,tx,us,circle,20,1/17/2004,28.978333,-96.645833
4,kaneohe,hi,us,light,900,1/22/2004,21.418056,-157.803611


## Dataframe Structure

In [36]:
# what features (columns) are present in the dataset
ufo.columns

# How many rows and columns are there?
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88870 entries, 0 to 88869
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   city              88674 non-null  object 
 1   state             81353 non-null  object 
 2   country           76311 non-null  object 
 3   shape             85754 non-null  object 
 4   duration_seconds  88870 non-null  int64  
 5   date.posted       88870 non-null  object 
 6   latitude          88673 non-null  float64
 7   longitude         88870 non-null  float64
dtypes: float64(2), int64(1), object(5)
memory usage: 5.4+ MB


In [37]:
#if you want to view first n number of rows
ufo.head(10)

##if you want to view last n number of rows
ufo.tail()

Unnamed: 0,city,state,country,shape,duration_seconds,date.posted,latitude,longitude
88865,napa,ca,us,other,1200,9/30/2013,38.297222,-122.284444
88866,vienna,va,us,circle,5,9/30/2013,38.901111,-77.265556
88867,edmond,ok,us,cigar,1020,9/30/2013,35.652778,-97.477778
88868,starr,sc,us,diamond,0,9/30/2013,34.376944,-82.695833
88869,ft. lauderdale,fl,us,oval,0,9/30/2013,26.121944,-80.143611


You could almost think of a dataframe as a dictionary. A Dictionary is a built in data structure and a dataframe builds on top of that structure giving us more fuctionality. (i.e. allowing us more features to mainpulate and view our data)


In [38]:
#access specific columns
ufo.city
ufo["city"]

0                  san marcos
1                lackland afb
2        chester (uk/england)
3                        edna
4                     kaneohe
                 ...         
88865                    napa
88866                  vienna
88867                  edmond
88868                   starr
88869          ft. lauderdale
Name: city, Length: 88870, dtype: object

# Data Cleaning
Notice how one of the columns is using a dot (.) as a seperater while others is using an underscore (_). Let's clean up our data a bit such that the column naming convention is more uniform

In [42]:
ufo.columns
ufo = ufo.rename(columns={'date.posted':'date_posted'}) #notice the dictionary notation
ufo.columns

Index(['city', 'state', 'country', 'shape', 'duration_seconds', 'date_posted',
       'latitude', 'longitude'],
      dtype='object')

# Data Manipulation 
A way to rearrange your data to help you understand the underlying patterns within
## Filtering
Allows you to subset (choose specific data points based on certain criteria) your data based on certain criteroa


In [43]:
ufo.head()

Unnamed: 0,city,state,country,shape,duration_seconds,date_posted,latitude,longitude
0,san marcos,tx,us,cylinder,2700,4/27/2004,29.883056,-97.941111
1,lackland afb,tx,,light,7200,12/16/2005,29.38421,-98.581082
2,chester (uk/england),,gb,circle,20,1/21/2008,53.2,-2.916667
3,edna,tx,us,circle,20,1/17/2004,28.978333,-96.645833
4,kaneohe,hi,us,light,900,1/22/2004,21.418056,-157.803611


In [44]:
#how many ufos were in the air longer than 900 seconds?
ufo.loc[ufo['duration_seconds'] > 900] #.loc is another way to choose specific columns

Unnamed: 0,city,state,country,shape,duration_seconds,date_posted,latitude,longitude
0,san marcos,tx,us,cylinder,2700,4/27/2004,29.883056,-97.941111
1,lackland afb,tx,,light,7200,12/16/2005,29.384210,-98.581082
7,norwalk,ct,us,disk,1200,10/2/1999,41.117500,-73.408333
12,bellmore,ny,us,disk,1800,5/11/2000,40.668611,-73.527500
15,harlan county,ky,us,circle,1200,9/15/2005,36.843056,-83.321944
...,...,...,...,...,...,...,...,...
88840,new york city (brooklyn),ny,us,light,1290,9/24/2012,40.714167,-74.006389
88854,clifton,nj,,other,3600,9/30/2013,40.858433,-74.163755
88863,boise,id,us,circle,1200,9/30/2013,43.613611,-116.202500
88865,napa,ca,us,other,1200,9/30/2013,38.297222,-122.284444


In [48]:
#give me all the ufos that were seen in New York
ufo.loc[ufo['state'] == "ny"]

#give me all the ufos that were in the sky between 900-1000 seconds
ufo.loc[(ufo['duration_seconds'] > 900) & (ufo['duration_seconds'] < 1000)].head()

Unnamed: 0,city,state,country,shape,duration_seconds,date_posted,latitude,longitude
3742,stockton,ca,us,light,960,10/23/2013,37.957778,-121.289722
6070,lincoln,ne,us,other,960,12/3/2004,40.8,-96.666667
6389,matteson,il,us,,960,11/2/2004,41.503889,-87.713056
7758,tyler,tx,us,flash,960,10/19/1999,32.351111,-95.300833
8928,indianapolis,in,us,disk,960,2/7/2014,39.768333,-86.158056
