# combine datasets

There are several ways to combine multiple CSV files in multiple folders. 

Here are a few options:

- Using the command line: You can use the “cat” command to concatenate all of the CSV files in a given directory and its subdirectories. For example, to combine all CSV files in the current directory and its subdirectories, you would use the command: “cat **/*.csv > combined.csv”
- Using Python: You can use the pandas library in Python to read in multiple CSV files and concatenate them into a single DataFrame. You can then export the DataFrame to a new CSV file.
- Using R: You can use the read.csv() function in R to read in multiple CSV files and bind them together using the rbind() function. You can then write the combined data frame into a new csv using write.csv()
- Using Excel or Google Sheets: You can open each CSV file in Excel or Google Sheets and copy and paste the data into a new sheet or workbook. Then you can save the combined data as a new CSV file.

It depends on your skill set, knowledge and the size of your files which options would be the best.

You can use the pandas library in Python to combine multiple CSV files in multiple folders. Here is an example of how you can do this:

```python
import os
import pandas as pd
path = '/Combiner_Files'  # replace with the path to the folder containing the source CSV files
all_files = os.listdir(path)
csv_files = [file for file in all_files if file.endswith('.csv')]
df_list = []
for file in csv_files:
    full_path = os.path.join(path, file)
    df = pd.read_csv(full_path)
    df_list.append(df)
combined_df = pd.concat(df_list)
combined_df.to_csv('combined.csv', index=False)
```

This code will go through all the csv files in the folder, read them into dataframe and then concatenate them into a single DataFrame. The resulting DataFrame is then saved to a new CSV file called “combined.csv” in the same directory.

You can also use the glob library to find all the csv files in subfolders recursively and replace the 

```python
os.listdir(path) 
```

with 

```python
glob.glob(path + '/**/*.csv', recursive=True)
```

Please note that you need to replace /path/to/folder in the code above with the actual path to the folder containing the CSV files.



In [3]:
import os
import pandas as pd
path = 'Combiner_Files'  # replace with the path to the folder containing the source CSV files
all_files = os.listdir(path)
csv_files = [file for file in all_files if file.endswith('.csv')]
df_list = []
for file in csv_files:
    full_path = os.path.join(path, file)
    df = pd.read_csv(full_path)
    df_list.append(df)
combined_df = pd.concat(df_list)
combined_df.to_csv('Combiner_Files/combined.csv', index=False)

###  Notice that you have combined the files but isn't really what we wanted.

You can use the pd.concat() function in pandas to combine two files, each with specified columns. Here’s an example of how you can do this:

```python
import pandas as pd

# Read the first file into a DataFrame and select specific columns
df1 = pd.read_csv('file1.csv', usecols=['column1', 'column2'])

# Read the second file into a DataFrame and select specific columns
df2 = pd.read_csv('file2.csv', usecols=['column3', 'column4'])

# Combine the two DataFrames using pd.concat
combined_df = pd.concat([df1, df2], axis=1)
print(combined_df)
```

In this example, the pd.read_csv() function is used to read the first file and select only the columns ‘column1’ and ‘column2’, and the second file and select only the columns ‘column3’ and ‘column4’. Then the pd.concat() function is used to combine the two DataFrames along the columns (axis=1) and the resulting DataFrame contains the data from both files and only the columns specified.

In [7]:
# Read the first file into a DataFrame and select specific columns
df1 = pd.read_csv('Combiner_Files/file1.csv', usecols=['uniqueID', 'FirstName', 'LastName'])

# Read the second file into a DataFrame and select specific columns
df2 = pd.read_csv('Combiner_Files/file2.csv', usecols=['StreetNumber', 'StreetName', 'TownCity', 'State'])

# Combine the two DataFrames using pd.concat
combined_df = pd.concat([df1, df2], axis=1)
combined_df.to_csv('Combiner_Files/combined2.csv', index=False)


### This gets very close to what we wanted

You can also use the merge function to merge the two files based on a common column

```python
combined_df = pd.merge(df1, df2, on='common_column') 
```

This will merge the two dataframe on the column ‘common_column’ and the resulting DataFrame will contain the data from both files and only the columns specified in both files.

In [12]:
#
# Read the first file into a DataFrame 
df1 = pd.read_csv('Combiner_Files/file1.csv')

# Read the second file into a DataFrame 
df2 = pd.read_csv('Combiner_Files/file2.csv')

# Combine the two DataFrames using pd.merge and use 'uniqueID'
combined_df = pd.merge(df1, df2, on='uniqueID') 
combined_df.to_csv('Combiner_Files/combined3.csv', index=False)

### This does the combine a little bit more eloquently

Say you had to different named files and you just wanted to bring a specific file in and specific columns

You can use the glob module in Python to find files that match a specific pattern and then use regular expressions (regex) to identify the specific files you want to use as Name.csv and Address.csv.

Here’s an example of how you can do this:

```python
import glob
import re

# Use glob to find all files that match a specific pattern
files = glob.glob('Combiner_Files/*.csv')

# Use a regular expression to identify the specific files you want to use
name_file = next(f for f in files if re.search(r'Name', f))
address_file = next(f for f in files if re.search(r'Address', f))

# Read the files into DataFrames
df_name = pd.read_csv(name_file)
df_address = pd.read_csv(address_file)

# Perform operations on the DataFrames
...
```

In this example, the glob.glob() function is used to find all .csv files in the path/to/files/ directory. Then, the next() function is used in conjunction with a regular expression to identify the specific file with the name Name and Address and assign it to the variable name_file and address_file respectively. Finally, the two files are read into DataFrames using the pd.read_csv() function.

You can also use the re.search() function to find a file that matches a specific pattern. For example, if the files are named as ‘file1-Name.csv’ and ‘file2-Address.csv’ you can use

```python
name_file = next(f for f in files if re.search(r'Name_2023_01_01', f))
address_file = next(f for f in files if re.search(r'Address_2023_01_01', f))
```

This will assign the Name_2023_01_01.csv and Address_2023_01_01.csv to the variables name_file and address_file respectively.

It’s worth noting that, if the files are not present in the directory or the pattern is not matching any files, this will raise a StopIteration exception. You should handle this exception in your code, for example by using a try-except block.

In [20]:
import glob
import re

# Use glob to find all files that match a specific pattern
files = glob.glob('Combiner_Files/*.csv')

# Use a regular expression to identify the specific files you want to use
name_file = next(f for f in files if re.search(r'Name', f))
address_file = next(f for f in files if re.search(r'Address', f))

# Read the files into DataFrames
df_name = pd.read_csv(name_file)
df_address = pd.read_csv(address_file)

# Combine the two DataFrames using pd.merge and use 'uniqueID'
combined_df = pd.merge(df_name, df_address, on='uniqueID') 
combined_df.to_csv('Combiner_Files/combined-Name-Address.csv', index=False)